CN110633464A - Semantic recognition method, device, medium and electronic equipment - Google Patents

Semantic recognition method, device, medium and electronic equipment Download PDF

Info

Publication number
CN110633464A
CN110633464A CN201810653680.5A CN201810653680A CN110633464A CN 110633464 A CN110633464 A CN 110633464A CN 201810653680 A CN201810653680 A CN 201810653680A CN 110633464 A CN110633464 A CN 110633464A
Authority
CN
China
Prior art keywords
terms
feature
weight
characteristic
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810653680.5A
Other languages
Chinese (zh)
Inventor
王颖帅
李晓霞
苗诗雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810653680.5A priority Critical patent/CN110633464A/en
Publication of CN110633464A publication Critical patent/CN110633464A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the invention provides a semantic recognition method, which comprises the following steps: performing word segmentation on the acquired short text data, and determining a characteristic lexical item in the short text data; determining the weight of the characteristic terms according to the attributes of the characteristic terms; extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item; and realizing semantic recognition based on the target term list. According to the technical scheme of the embodiment of the invention, the word segmentation effect of the short text data is improved, the keywords meeting the actual requirements are extracted, the comprehensive weight calculation is carried out on the keywords, the fuzzy intention understanding can be carried out on the user input information by combining a semantic recognition model, so that the corresponding entities are associated, and the recognition accuracy of the fuzzy linguistic data is improved.

Description

Semantic recognition method, device, medium and electronic equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a semantic recognition method, a semantic recognition device, a semantic recognition medium and electronic equipment.
Background
With the development of big data and artificial intelligence application, the amount of unstructured data such as text and voice generated on a network is greatly increased, and in the face of the massive data, how to let a computer learn to automatically extract data information like a person and further improve the information processing capability of the computer is an important subject of natural language processing at present. In the e-commerce field, more and more users like online shopping by chatting with an intelligent robot, and it is very important that the user words are relatively short in text length, but noise is mixed in the words, so that the robot can correctly and quickly understand the user semantics. Based on the background, the invention provides an improved short text semantic recognition algorithm based on TextRank4zh, which can enable a computer to dig out meaningful and important words in a given text, thereby assisting a robot in improving the natural language understanding ability.
In the prior art, there are three main methods for keyword extraction: (1) extracting statistical characteristics based on a Term Frequency-Inverse Document Frequency (TF-IDF, Term Frequency-Inverse Document Frequency); (2) keyword extraction based on (LDA, LatentDirichlet Allocation) document topic model; (3) and extracting keywords based on TextRank.
However, the above three prior arts have the following disadvantages:
(1) the extraction of keywords based on the TF-IDF method is simple, although the occurrence frequency of words is considered, some words with less occurrence frequency but relatively large weight are ignored, and the extraction accuracy is not high;
(2) a graph model-based document keyword extraction method is characterized in that each word in a document is regarded as a related network graph, each word is taken as a point, the connection between the words is taken as an edge, and the weight of each point is propagated according to the edge connected with the point until the whole network tends to be stable;
(3) the classical TextRank algorithm considers the information of a document, the weight of the extracted candidate keywords is transmitted to adjacent nodes according to the out degree and is not limited by the text data amount, the algorithm initializes the vertex weights of two words to 1, but ignores the attributes of the words, and the effect on Chinese support is poor.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
The embodiment of the invention aims to provide a semantic recognition method, so as to overcome the problem that the traditional scheme cannot accurately recognize fuzzy semantics at least to a certain extent.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the embodiments of the present invention, there is provided a semantic recognition method, including:
performing word segmentation on the acquired short text data, and determining a characteristic lexical item in the short text data;
determining the weight of the characteristic terms according to the attributes of the characteristic terms;
extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
and realizing semantic recognition based on the target term list.
In some embodiments of the present invention, based on the foregoing solution, before performing word segmentation on the acquired short text data, the method further includes:
acquiring an entity description database table and a corpus, wherein the entity database table at least comprises the following fields: entity title, entity function description, entity class and entity brand;
extracting keywords from the entity description database table based on a preset attribute description rule to obtain entity description keywords;
filtering non-modified corpora in the corpora based on a preset regular expression to obtain modified corpora;
and generating a short text database based on the entity description keywords and the modified linguistic data.
In some embodiments of the present invention, based on the foregoing solution, the generating a short text database based on the entity description keyword and the modified corpus includes:
associating the entity description key words with the modification linguistic data to determine the modification linguistic data corresponding to the entity description key words and generate short text data;
and generating a short text database based on each short text data.
In some embodiments of the present invention, based on the foregoing scheme, before determining the weight of the feature term based on the attribute of the feature term, the method further includes:
filtering out the deactivated lexical items and the low-frequency lexical items in the characteristic lexical items;
and performing relevance analysis on the feature terms obtained after filtering, merging the feature terms with the similarity greater than or equal to a preset threshold value, and obtaining the merged feature terms.
In some embodiments of the present invention, based on the foregoing scheme, the determining, based on the attribute of the feature term, a weight of the feature term includes:
performing TF-IDF calculation on the feature terms to obtain TF-IDF values of the feature terms;
determining the TF-IDF value of each feature term as a first weight parameter of the feature term.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises:
according to the position of the characteristic terms in the short text data, and/or the length of the characteristic terms, and/or the parts of speech of the characteristic terms, giving a weight with a preset proportion to each characteristic term;
determining the assigned weight as a second weight parameter of the feature term.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises:
performing loop iteration on each feature term based on the first weight and the second weight of each feature term, and calculating the ranking score of each feature term;
and determining the characteristic terms with the sorting scores larger than a preset sorting threshold value as target terms, and generating a target term list.
In some embodiments of the present invention, based on the foregoing solution, the implementing semantic recognition based on the target term list includes:
acquiring data to be identified;
associating the data to be identified with the target term list to determine an entity corresponding to the data to be identified;
and determining the associated entity as a semantic recognition result of the data to be recognized.
According to a second aspect of the embodiments of the present invention, there is provided a semantic recognition apparatus including: the system comprises a word segmentation module, a weight module, a sorting module and an identification module; wherein the content of the first and second substances,
the word segmentation module is used for segmenting the acquired short text data and determining the characteristic lexical items in the short text data;
the weighting module is used for determining the weight of the characteristic terms based on the attributes of the characteristic terms;
the ordering module is used for extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
and the recognition module is used for realizing semantic recognition based on the target term list.
In some embodiments of the present invention, based on the foregoing solution, the apparatus further includes:
the system comprises a construction module, a database and a corpus, wherein the construction module is used for acquiring an entity description database table and a corpus, and the entity database table at least comprises the following fields: entity title, entity function description, entity class and entity brand; extracting keywords from the entity description database table based on a preset attribute description rule to obtain entity description keywords; filtering non-modified corpora in the corpora based on a preset regular expression to obtain modified corpora; and generating a short text database based on the entity description keywords and the modified linguistic data.
In some embodiments of the present invention, based on the foregoing scheme, the building module is specifically configured to associate the entity description keyword with the modified corpus, determine the modified corpus corresponding to the entity description keyword, and generate short text data; generating a short text database based on each short text data
In some embodiments of the present invention, based on the foregoing solution, the apparatus further includes:
the merging module is used for filtering out the disabled terms and low-frequency terms in the feature terms; and performing relevance analysis on the feature terms obtained after filtering, merging the feature terms with the similarity greater than or equal to a preset threshold value, and obtaining the merged feature terms.
In some embodiments of the present invention, based on the foregoing scheme, the weighting module is specifically configured to perform TF-IDF calculation on the feature terms to obtain a TF-IDF value of each feature term; determining the TF-IDF value of each feature term as a first weight parameter of the feature term.
In some embodiments of the present invention, based on the foregoing solution, the weighting module is further configured to assign a preset proportion of weight to each feature term according to a position of the feature term in the short text data, and/or a length of the feature term, and/or a part of speech of the feature term; determining the assigned weight as a second weight parameter of the feature term
In some embodiments of the present invention, based on the foregoing scheme, the weighting module is further configured to perform loop iteration on each feature term based on the first weight and the second weight of each feature term, and calculate a ranking score of each feature term; and determining the characteristic terms with the sorting scores larger than a preset sorting threshold value as target terms, and generating a target term list.
In some embodiments of the present invention, based on the foregoing scheme, the identification module is specifically configured to acquire data to be identified; associating the data to be identified with the target term list to determine an entity corresponding to the data to be identified; and determining the associated entity as a semantic recognition result of the data to be recognized.
According to a third aspect of embodiments of the present invention, there is provided a computer-readable medium, on which a computer program is stored, which when executed by a processor, implements the semantic recognition method as described in the first aspect of the embodiments above.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the semantic recognition method as described in the first aspect of the embodiments above.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
in the technical solutions provided by some embodiments of the present invention, the feature terms in the short text data are determined by performing word segmentation on the obtained short text data; determining the weight of the characteristic terms according to the attributes of the characteristic terms; extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item; and realizing semantic recognition based on the target term list. According to the technical scheme of the embodiment of the invention, the word segmentation effect of the short text data is improved, the keywords meeting the actual requirements are extracted, the comprehensive weight calculation is carried out on the keywords, the fuzzy intention understanding can be carried out on the user input information by combining a semantic recognition model, so that the corresponding entities are associated, and the recognition accuracy of the fuzzy linguistic data is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
fig. 1 schematically shows a flow chart of a method of semantic recognition according to an embodiment of the invention.
FIG. 2 schematically shows a flow diagram for generating a short text database according to one embodiment of the invention.
Fig. 3 schematically shows a flow diagram of feature term merging according to an embodiment of the invention.
Fig. 4 schematically shows a flow chart for extracting target terms and generating a target term list according to an embodiment of the present invention.
Fig. 5 schematically shows a block diagram of a semantic recognition apparatus according to an embodiment of the present invention.
FIG. 6 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 schematically shows a flow chart of a method of semantic recognition according to an embodiment of the invention.
Referring to fig. 1, a semantic recognition method according to an embodiment of the present invention includes the following steps:
step S110, performing word segmentation on the acquired short text data, and determining a characteristic lexical item in the short text data;
step S120, determining the weight of the feature terms according to the attributes of the feature terms;
step S130, extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
and step S140, realizing semantic recognition based on the target term list.
The technical scheme of the embodiment of the invention determines the characteristic lexical item in the short text data by segmenting the acquired short text data; determining the weight of the feature terms according to the attributes of the feature terms; extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item; and realizing semantic recognition based on the target term list. According to the technical scheme of the embodiment of the invention, the word segmentation effect of the short text data is improved, the keywords meeting the actual requirements are extracted, the comprehensive weight calculation is carried out on the keywords, the fuzzy intention understanding can be carried out on the user input information by combining a semantic recognition model, so that the corresponding entities are associated, and the recognition accuracy of the fuzzy linguistic data is improved.
Implementation details of the various steps shown in FIG. 1 are set forth below:
in step S110, the obtained short text data is segmented, and feature terms in the short text data are determined.
In an embodiment of the present invention, based on the foregoing scheme, before performing word segmentation on short text data, a short text database is generated, as shown in fig. 2, including the following steps:
step S210, acquiring an entity description database table and a corpus;
in one embodiment of the invention, the entity description database table may be derived from an entity wide table comprising the following: entity title, entity functional property description, entity brand and entity category, and the entity can be a commodity.
In an embodiment of the present invention, based on the foregoing solution, the user prefers to perform a dialogue with an intelligent assistant (machine semantic recognition) for some fixed goods categories, so that the predetermined intelligent assistant (machine semantic recognition) filters the material to filter out some goods categories, for example: mobile phones, clothing, cosmetics, etc.
Step S220, extracting keywords from an entity description database table based on preset attribute description rules to obtain entity description keywords;
in an embodiment of the invention, based on the scheme, the encyclopedic general knowledge map of the family of.
Step S230, filtering the non-modified corpora in the corpora based on a preset regular expression to obtain modified corpora;
in one embodiment of the present invention, the first sentence in the dialog between the user and the intelligent assistant (machine semantic recognition) is used as the corpus because the first sentence in the dialog between the user and the intelligent assistant is heavily weighted during the recognition of the input corpus by the intelligent assistant.
In an embodiment of the present invention, based on the foregoing scheme, the corpus without information content in the user input is removed by regular matching, and the corpus with information content is output, for example: product words, brand words, modifiers, etc., as shown in table 1 below.
User's first sentence corpus Product(s) Brand Modifier word
Buying a good garment steamer will need to be more expensive Hanging ironing machine N/A Low price
Three-piece fashion for Huiman I to return goods Three-piece suit Moyanman (Moliman) N/A
Electric wind for returning goodsFan (Refresh Fan) Electric fan Beauty treatment N/A
How can the small refrigerator bought by I be in contact with your for maintenance when the small refrigerator is broken Small refrigerator N/A Run out
If the screen of the Samsung mobile phone is damaged, some mobile phones are replaced Mobile phone screen Three stars Run out
Your stainless steel quality is too poor Stainless steel N/A Poor quality
Leather shoes with carved patterns Leather shoes All-grass of King-Li Lai Carved pattern
Summer new Bluetooth sound box Bluetooth sound box Summer medicine N/A
Do not need to return goods N/A N/A N/A
TABLE 1
Step S240, generating a short text database based on the entity description key words and the modification linguistic data.
In one embodiment of the invention, the entity description keywords are associated with the modification corpus to determine the modification corpus corresponding to the entity description keywords, and short text data is generated; and generating a short text database based on each short text data.
In an embodiment of the present invention, based on the foregoing solution, the corpus extracted by the intelligent assistant and the extracted description keywords together form a short text database, and the database enriches the underlying corpus to be preprocessed subsequently.
In step S120, the weight of the feature term is determined according to the attribute of the feature term.
In one embodiment of the present invention, based on the foregoing scheme, the short text data needs to be preprocessed before determining the weight of the feature term, and deactivated terms and low-frequency terms in the feature term are filtered out; performing relevance analysis on the feature terms obtained after filtering, merging the feature terms with similarity greater than or equal to a preset threshold value to obtain merged feature terms, as shown in fig. 3, including the following steps:
step S310, performing word segmentation on the short text data to obtain characteristic terms;
in one embodiment of the invention, a user-defined product word bank, a brand word bank and a book title special word bank can be added, so that the word segmentation result is closer to the related service.
Step S320, filtering the feature terms;
in an embodiment of the invention, each piece of short text data in a short text database is segmented to obtain the feature lexical item of a sentence, and in practical application, because the feature dimensionality is high, the extracted keyword has a plurality of irrelevant feature lexical items, therefore, the semantic invention provided by the embodiment of the invention matches the extracted keyword with a preset disabled vocabulary on one hand, removes the irrelevant feature lexical item and filters out sensitive words at the same time, wherein the disabled vocabulary can be customized to meet the practical requirement; in another scheme, the low-frequency terms in the extracted key terms are deleted so as to reduce the dimensionality of the feature space and facilitate subsequent processing.
Step S330, carrying out correlation analysis on the filtered feature terms;
in an embodiment of the present invention, feature relevance analysis is adopted to obtain feature vectors of filtered feature terms, similarity between the feature terms is calculated and used as a weight of an edge between the feature terms, and if there is no similarity between two feature terms, it means that there is no corresponding edge between the two feature terms.
Step S340, merging the feature terms based on the correlation analysis.
In an embodiment of the present invention, based on the foregoing scheme, the synonyms, or terms having the same attribute determined after the correlation analysis are merged to refine the feature terms.
Step S130, extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
in one embodiment of the invention, TF-IDF calculation is carried out on the characteristic terms to obtain TF-IDF values of the characteristic terms; determining the TF-IDF value of each characteristic lexical item as a first weight parameter of the characteristic lexical item; according to the position of the characteristic lexical item in the short text data, and/or the length of the characteristic lexical item, and/or the part of speech of the characteristic lexical item, giving a weight with a preset proportion to each characteristic lexical item; determining the given weight as a second weight parameter of the feature term; performing loop iteration on each feature term based on the first weight and the second weight of each feature term, and calculating the ranking score of each feature term; and determining the characteristic terms with the sorting scores larger than a preset sorting threshold value as target terms, and generating a target term list.
In an embodiment of the present invention, comprehensively evaluating at least four attributes of the feature term, such as TF-IDF value, term position, term length, and part of speech, and determining a comprehensive weight value obtained by evaluation as a weight of the feature term, specifically, referring to fig. 4, assigning a weight to the feature term includes:
(1) TF-IDF value
In one embodiment of the present invention, TF is the frequency of occurrence of a word in a corpus, defined as
Figure BDA0001704837750000101
Wherein n isi,jThe number of times of the word appearing in the current short text is used, and the denominator is the sum of the number of times of all the words appearing in the corpus; the IDF value embodies the general importance of a word and is defined as
Figure BDA0001704837750000102
| D | is the total number of short texts in the corpus, and the denominator is the contained word wiBut when the word is not in the corpus, results in a denominator of 0, so 1 is added.
(2) Term location
In one embodiment of the invention, the positions of the words appearing in the short text database are divided into 3 positions, such as the beginning, the middle, the end and the like of the short text, and different weights are given according to the importance degrees of the short text;
(3) length of term
In one embodiment of the invention, the length of the feature terms is divided into four levels of 2, 3, 4, more than 4 and the like through statistics on the length of the feature terms, and different weights are given to the feature terms according to the importance degree of the levels;
(4) part of speech
In one embodiment of the present invention, the part of speech is divided into by the statistics of the part of speech of the feature term: nouns, short phrases, adjective nouns, verb nouns, and the like are given different weights according to their degrees of importance.
In one embodiment of the invention, part-of-speech tags may be added to a custom thesaurus.
In an embodiment of the present invention, based on the foregoing scheme, a ranking score is calculated for each feature term through a comprehensive weight function, and a keyword term in the short text data is extracted, specifically, a relational graph may be constructed by using a graph-based TextRank4zh ranking algorithm, and vertex importance is determined by iterating information of the whole graph, where parameter setting may be filtering some specified parts of speech, and part of speech in a blacklist may be stored through an allow _ speed _ tags variable, for example: the commodity is described as "water-moistening and skin-tendering, thorough-moistening and moisture-preserving and non-drying", and the keywords extracted by the semantic recognition method provided by the embodiment of the invention are "water-moistening" and "skin-tendering", and the weights are 0.5 respectively.
In one embodiment of the present invention, referring to fig. 4, extracting the target terms and generating the target term list includes the following steps:
step S410, short text data are preprocessed;
step S420, counting TF-IDF values of each characteristic lexical item;
step S430, counting the position, length and part of speech of the characteristic terms;
step S440, calculating comprehensive weight;
s450, constructing a TextRank graph model;
step S460, calculating the score of each feature term;
step S470, judging whether the scores of the feature terms are converged; if not, returning to step S460; if converging, execute step S480;
step S480, ordering the feature terms according to the scores of the feature terms;
and step S490, outputting the target terms and the ecological city target term list.
With continued reference to fig. 1, step S140 implements semantic recognition based on the list of target terms.
In one embodiment of the invention, the semantic recognition may be implemented by acquiring data to be recognized; associating the data to be identified with the target term list, and determining an entity corresponding to the data to be identified; and determining the associated entity as a semantic recognition result of the data to be recognized.
In one embodiment of the invention, the executing agent that implements semantic recognition may be a smart assistant, the intelligent assistant can understand the product words, brand words and modifiers input by the user, then call the search interface to inquire, return the result and display the result to the user, if the product words and brand words can be extracted, in general, the shopping intention of the user is relatively clear, but when the input corpus of the user only contains modifiers and product words and brand words cannot be extracted, the corresponding entities can be associated in the target disk term list provided by the embodiment of the invention, in some practical application cases, the description of the commodity attribute may be very long, and at this time, the method provided by the embodiment of the present invention needs to be used to extract the keyword, and the commodity attribute description and the extracted keyword can be associated to obtain the recommendable commodity through the knowledge map.
In an embodiment of the present invention, the target term list provided in the embodiment of the present invention may be supplemented by a product lexicon and a brand lexicon after being processed by a certain policy, and may also be combined with a manually labeled modifier to construct a commonly used description lexicon under a specific category, so as to construct an NLP bottom-layer corpus.
The following describes embodiments of the apparatus of the present invention, which can be used to perform the above semantic recognition method of the present invention.
Fig. 5 schematically shows a block diagram of a semantic recognition apparatus according to an embodiment of the present invention.
Referring to fig. 5, a semantic recognition apparatus 500 according to an embodiment of the present invention includes: a word segmentation module 501, a weight module 502, a sorting module 503 and an identification module 504; wherein the content of the first and second substances,
a word segmentation module 501, configured to segment words of the obtained short text data and determine feature terms in the short text data;
a weight module 502, configured to determine a weight of the feature term based on an attribute of the feature term;
the sorting module 503 is configured to extract a target term according to the weight of the feature term and generate a target term list;
and the recognition module 504 is used for realizing semantic recognition based on the target term list.
In some embodiments of the present invention, based on the foregoing solution, the apparatus further includes:
a building module 505, configured to obtain an entity description database table and a corpus, where the entity database table at least includes the following fields: entity title, entity function description, entity class and entity brand; extracting keywords from an entity description database table based on a preset attribute description rule to obtain entity description keywords; filtering non-modified corpora in the corpora based on a preset regular expression to obtain modified corpora; and generating a short text database based on the entity description key words and the modification linguistic data.
In some embodiments of the present invention, based on the foregoing scheme, the building module 505 is specifically configured to associate the entity description keyword with the modified corpus, determine the modified corpus corresponding to the entity description keyword, and generate short text data; generating a short text database based on each short text data
In some embodiments of the present invention, based on the foregoing solution, the apparatus further includes:
a merging module 506, configured to filter out deactivated terms and low-frequency terms in the feature terms; and performing relevance analysis on the feature terms obtained after filtering, merging the feature terms with the similarity greater than or equal to a preset threshold value, and obtaining the merged feature terms.
In some embodiments of the present invention, based on the foregoing scheme, the weighting module 502 is specifically configured to perform TF-IDF calculation on the feature terms to obtain TF-IDF values of the feature terms; and determining the TF-IDF value of each characteristic lexical item as a first weight parameter of the characteristic lexical item.
In some embodiments of the present invention, based on the foregoing scheme, the weighting module 502 is further configured to assign a preset proportion of weight to each feature term according to a position of the feature term in the short text data, and/or a length of the feature term, and/or a part of speech of the feature term; determining the assigned weight as a second weight parameter of the feature term
In some embodiments of the present invention, based on the foregoing scheme, the weighting module 502 is further configured to perform loop iteration on each feature term based on the first weight and the second weight of each feature term, and calculate a ranking score of each feature term; and determining the characteristic terms with the sorting scores larger than a preset sorting threshold value as target terms, and generating a target term list.
In some embodiments of the present invention, based on the foregoing scheme, the identifying module 504 is specifically configured to obtain data to be identified; associating the data to be identified with the target term list, and determining an entity corresponding to the data to be identified; and determining the associated entity as a semantic recognition result of the data to be recognized.
For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the above-described embodiments of the semantic recognition method of the present invention for the functional modules of the semantic recognition apparatus of the exemplary embodiment of the present invention corresponding to the steps of the above-described exemplary embodiment of the semantic recognition method.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use with the electronic device implementing an embodiment of the present invention. The computer system 600 of the electronic device shown in fig. 6 is only an example, and should not bring any limitation to the function and the scope of the use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for system operation are also stored. The CPU601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the semantic recognition method as in the above embodiments.
For example, the electronic device may implement as shown in fig. 1: step S110, performing word segmentation on the acquired short text data, and determining a characteristic lexical item in the short text data; step S120, determining the weight of the feature terms according to the attributes of the feature terms; step S130, extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item; and step S140, realizing semantic recognition based on the target term list.
As another example, the electronic device may implement the steps shown in FIG. 2.
As another example, the electronic device may implement the steps shown in fig. 3.
As another example, the electronic device may implement the steps shown in fig. 4.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (11)

1. A method of semantic identification, comprising:
performing word segmentation on the acquired short text data, and determining a characteristic lexical item in the short text data;
determining the weight of the characteristic terms according to the attributes of the characteristic terms;
extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
and realizing semantic recognition based on the target term list.
2. The semantic recognition method according to claim 1, wherein before the segmenting the acquired short text data, the method further comprises:
acquiring an entity description database table and a corpus, wherein the entity database table at least comprises the following fields: entity title, entity function description, entity class and entity brand;
extracting keywords from the entity description database table based on a preset attribute description rule to obtain entity description keywords;
filtering non-modified corpora in the corpora based on a preset regular expression to obtain modified corpora;
and generating a short text database based on the entity description keywords and the modified linguistic data.
3. The semantic recognition method according to claim 2, wherein the generating a short text database based on the entity description keywords and the modified corpus comprises:
associating the entity description key words with the modification linguistic data to determine the modification linguistic data corresponding to the entity description key words and generate short text data;
and generating a short text database based on each short text data.
4. The semantic recognition method according to claim 1, wherein before determining the weight of the feature term based on the attribute of the feature term, the method further comprises:
filtering out the deactivated lexical items and the low-frequency lexical items in the characteristic lexical items;
and performing relevance analysis on the feature terms obtained after filtering, merging the feature terms with the similarity greater than or equal to a preset threshold value, and obtaining the merged feature terms.
5. The semantic recognition method according to claim 1, wherein the determining the weight of the feature term based on the attribute of the feature term comprises:
performing TF-IDF calculation on the feature terms to obtain TF-IDF values of the feature terms;
determining the TF-IDF value of each feature term as a first weight parameter of the feature term.
6. The semantic recognition method of claim 5, further comprising:
according to the position of the characteristic terms in the short text data, and/or the length of the characteristic terms, and/or the parts of speech of the characteristic terms, giving a weight with a preset proportion to each characteristic term;
determining the assigned weight as a second weight parameter of the feature term.
7. The semantic recognition method according to claim 6 or 7, characterized in that the method further comprises:
performing loop iteration on each feature term based on the first weight and the second weight of each feature term, and calculating the ranking score of each feature term;
and determining the characteristic terms with the sorting scores larger than a preset sorting threshold value as target terms, and generating a target term list.
8. The semantic recognition method according to claim 1, wherein the performing semantic recognition based on the target term list comprises:
acquiring data to be identified;
associating the data to be identified with the target term list to determine an entity corresponding to the data to be identified;
and determining the associated entity as a semantic recognition result of the data to be recognized.
9. A semantic recognition apparatus, the apparatus comprising: the system comprises a word segmentation module, a weight module, a sorting module and an identification module; wherein the content of the first and second substances,
the word segmentation module is used for segmenting the acquired short text data and determining the characteristic lexical items in the short text data;
the weighting module is used for determining the weight of the characteristic terms based on the attributes of the characteristic terms;
the ordering module is used for extracting a target lexical item and generating a target lexical item list according to the weight of the characteristic lexical item;
and the recognition module is used for realizing semantic recognition based on the target term list.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 8.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a data processing method according to any one of claims 1 to 8.
CN201810653680.5A 2018-06-22 2018-06-22 Semantic recognition method, device, medium and electronic equipment Pending CN110633464A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810653680.5A CN110633464A (en) 2018-06-22 2018-06-22 Semantic recognition method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810653680.5A CN110633464A (en) 2018-06-22 2018-06-22 Semantic recognition method, device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN110633464A true CN110633464A (en) 2019-12-31

Family

ID=68967843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810653680.5A Pending CN110633464A (en) 2018-06-22 2018-06-22 Semantic recognition method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110633464A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931480A (en) * 2020-07-03 2020-11-13 北京新联财通咨询有限公司 Method and device for determining main content of text, storage medium and computer equipment
CN112464654A (en) * 2020-11-27 2021-03-09 科技日报社 Keyword generation method and device, electronic equipment and computer readable medium
CN112819512A (en) * 2021-01-22 2021-05-18 北京有竹居网络技术有限公司 Text processing method, device, equipment and medium
CN113051900A (en) * 2021-04-30 2021-06-29 中国平安人寿保险股份有限公司 Synonym recognition method and device, computer equipment and storage medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113496411A (en) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 Page pushing method, device and system, storage medium and electronic equipment
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN105808526A (en) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 Commodity short text core word extracting method and device
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN105808526A (en) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 Commodity short text core word extracting method and device
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113496411A (en) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 Page pushing method, device and system, storage medium and electronic equipment
CN111931480A (en) * 2020-07-03 2020-11-13 北京新联财通咨询有限公司 Method and device for determining main content of text, storage medium and computer equipment
CN111931480B (en) * 2020-07-03 2023-07-18 北京新联财通咨询有限公司 Text main content determining method and device, storage medium and computer equipment
CN112464654A (en) * 2020-11-27 2021-03-09 科技日报社 Keyword generation method and device, electronic equipment and computer readable medium
CN112819512A (en) * 2021-01-22 2021-05-18 北京有竹居网络技术有限公司 Text processing method, device, equipment and medium
CN112819512B (en) * 2021-01-22 2023-03-24 北京有竹居网络技术有限公司 Text processing method, device, equipment and medium
CN113051900A (en) * 2021-04-30 2021-06-29 中国平安人寿保险股份有限公司 Synonym recognition method and device, computer equipment and storage medium
CN113051900B (en) * 2021-04-30 2023-08-22 中国平安人寿保险股份有限公司 Synonym recognition method, synonym recognition device, computer equipment and storage medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113408286B (en) * 2021-05-28 2024-03-26 浙江工业大学 Chinese entity identification method and system oriented to field of mechanical and chemical industry
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction

Similar Documents

Publication Publication Date Title
CN108536852B (en) Question-answer interaction method and device, computer equipment and computer readable storage medium
CN110633464A (en) Semantic recognition method, device, medium and electronic equipment
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
CN106649818B (en) Application search intention identification method and device, application search method and server
US10120861B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US9489625B2 (en) Rapid development of virtual personal assistant applications
US9081411B2 (en) Rapid development of virtual personal assistant applications
KR102288249B1 (en) Information processing method, terminal, and computer storage medium
CN109960756B (en) News event information induction method
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
WO2017198031A1 (en) Semantic parsing method and apparatus
CN108287848B (en) Method and system for semantic parsing
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN109284389A (en) A kind of information processing method of text data, device
KR20140049680A (en) Sentiment classification system using rule-based multi agents
CN105912563A (en) Method of giving machines artificial intelligence learning based on knowledge of psychology
US20150269162A1 (en) Information processing device, information processing method, and computer program product
CN110717038A (en) Object classification method and device
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN113191145A (en) Keyword processing method and device, electronic equipment and medium
CN111563361A (en) Text label extraction method and device and storage medium
JP2016103156A (en) Text feature amount extraction device, text feature amount extraction method, and program
CN112182159B (en) Personalized search type dialogue method and system based on semantic representation
CN109298796B (en) Word association method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination