CN114416998A - Text label identification method and device, electronic equipment and storage medium - Google Patents

Text label identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114416998A
CN114416998A CN202210082518.9A CN202210082518A CN114416998A CN 114416998 A CN114416998 A CN 114416998A CN 202210082518 A CN202210082518 A CN 202210082518A CN 114416998 A CN114416998 A CN 114416998A
Authority
CN
China
Prior art keywords
text
entity
geographic
keyword
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210082518.9A
Other languages
Chinese (zh)
Inventor
宋威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ping An Smart Healthcare Technology Co ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202210082518.9A priority Critical patent/CN114416998A/en
Publication of CN114416998A publication Critical patent/CN114416998A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Abstract

The application is applicable to the technical field of big data, and provides a text label identification method, a text label identification device, electronic equipment and a storage medium, wherein the method comprises the following steps: responding to a tag configuration request of a target text, and determining candidate geographic keywords contained in the target text through a preset entity identification model; generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text; calculating the text label probability of the candidate geographic keywords according to the feature vectors corresponding to the candidate geographic keywords; and determining the geographic area label corresponding to the target text from all the candidate geographic keys based on the text label probability corresponding to each candidate geographic key. By adopting the method, the efficiency of text label identification can be greatly improved, and the labor cost is reduced.

Description

Text label identification method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of big data, and particularly relates to a text label identification method and device, electronic equipment and a storage medium.
Background
With the continuous development of network technology, each user or group can publish articles on the network, and thus the number of text data published on the network has increased geometrically. When texts on a network are sorted and analyzed, corresponding labels are often required to be added to the texts, so that the texts can be classified quickly. In some application scenarios, in order to determine a certain geographic area, texts may be classified according to geographic tags, so how to accurately identify the geographic area described by text content becomes a problem that needs to be solved urgently.
In the existing text label identification technology, because a plurality of different keywords related to a geographic area may appear in one text, when the geographic area described by the text needs to be determined to determine the geographic label of the text, label configuration is often required to be performed manually, so that the label configuration efficiency is greatly reduced, and in a scene where the number of texts increases in a geometric level, a large amount of manpower is often required to be consumed for text classification, and the manpower cost is further increased.
Disclosure of Invention
The embodiment of the application provides a text label identification method and device, electronic equipment and a storage medium, and can solve the problems that the existing text label identification technology is often classified in a manual configuration mode when the geographic area described by the text label content is determined, so that the labor cost of text label configuration is greatly increased, and the identification efficiency is low.
In a first aspect, an embodiment of the present application provides a text label identification method, including:
responding to a tag configuration request of a target text, and determining candidate geographic keywords contained in the target text through a preset entity identification model;
generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text;
calculating the text label probability of the candidate geographic keywords according to the feature vectors corresponding to the candidate geographic keywords;
and determining the geographic area label corresponding to the target text from all the candidate geographic keys based on the text label probability corresponding to each candidate geographic key.
In a possible implementation manner of the first aspect, the generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text includes:
determining a text characteristic parameter group of the candidate keywords based on the appearance position;
determining geographical aliases of the candidate geographical keywords, and determining alias characteristic parameters based on the occurrence times of all the geographical aliases in the target text;
identifying the number of entities in the target text, which have incidence relation with the candidate geographic keywords, and determining entity characteristic parameters;
acquiring a word cloud set of the target text, and determining semantic feature parameters based on the inclusion relation between the candidate geographic keywords and the word cloud set;
identifying the release information of the target text, and determining a release characteristic parameter group based on a first association degree between the release information and the candidate keywords;
determining an interactive characteristic parameter group according to each text interaction record;
generating the feature vector based on the text feature parameter set, the alias feature parameter, the entity feature parameter, the semantic feature parameter, the release feature parameter set, and the interaction feature parameter set.
In a possible implementation manner of the first aspect, the identifying posting information of the target text and determining a posting feature parameter group based on a first degree of association between the posting information and the candidate keyword includes:
determining a publishing object of the target text, and calculating a first publishing characteristic value based on a first distance value between a first geographic position associated with the publishing object and a target geographic position corresponding to the candidate keyword;
determining a text author of the target text, and acquiring a plurality of published texts associated with the text author;
calculating a second release characteristic value based on a second distance value between a second geographic position corresponding to the existing geographic label of each released text and the target geographic position; wherein the second release characteristic value is specifically:
Figure BDA0003486470030000021
wherein, Publish2Issuing a feature value for the second publication; distance (HisText)iAddressKey) is a second distance value between the second geographic location of the ith text and the target geographic location; the CurrentTime is the release time of the target text; timeiThe publication time of the ith published text; num ([ HisText ]i]) Is the total number of said published texts; max { Distance (HisText)iAddressKey) is a maximum value selecting function;
and determining the release characteristic parameter group according to the first release characteristic value and the second release characteristic value.
In a possible implementation manner of the first aspect, the text interaction record includes a text browsing record and a text comment record;
the determining the set of interactive feature parameters according to each text interaction record includes:
determining first user information of browsing objects of the text browsing records, and determining the second association degree according to the first user information and the candidate keywords;
determining comment content of a comment object of each text comment record, and determining a third degree of association based on the comment content and the candidate keywords;
and generating the set of interactive feature parameters according to the second relevance and the third relevance.
In a possible implementation manner of the first aspect, the calculating, according to the feature vector corresponding to the candidate geographic keyword, a text label probability of the candidate geographic keyword includes:
determining a characteristic reference value corresponding to each characteristic value in the characteristic vector, and respectively carrying out normalization processing on each characteristic value according to the characteristic reference pair;
obtaining a normalized feature vector based on the normalized feature value;
leading the normalized feature vector into a preset prediction module to generate a global feature vector;
and importing the global feature vector into a preset trend evaluation module, and calculating to obtain the text label probability.
In a possible implementation manner of the first aspect, the determining, by a preset entity recognition model, a candidate geographic keyword included in a target text in response to a tag configuration request of the target text includes:
responding to a tag configuration request of a target text, importing the target text into an entity identification model, and determining an entity keyword corresponding to the target text;
identifying entity keywords with co-occurrence relation in the target text, and determining the incidence relation among the entity keywords;
generating a knowledge graph based on the incidence relation among the entity keywords;
calculating a fourth degree of association between any two entity keywords; the fourth degree of association is:
Sim(E1,E2)=∑ei∈Context(E1),ej∈Context(E2)maxsimentity(ei,ej);
simentity(ei,ej)=∑p∈Prop(ei)∩Prop(ej)ωpSimlaritytype(p)(ei[p],ej[p])
wherein Sim (E1, E2) is the fourth degree of association between the two entity keywords; context (E1) is an associated entity of the entity keyword E1 having the association relation in the knowledge graph; context (E2) is an associated entity of the entity keyword E2 having the association relation in the knowledge graph; ei is the entity keywordThe ith associated entity within the association of E1; ej is the jth associated entity in the association relationship of the entity keyword E2; prop (ei) is the entity type of the ith associated entity in the association relationship of the entity keyword E1; prop (ej) is the entity type of the jth associated entity in the association relationship of the entity keyword E2; omegapThe weight value is corresponding to the entity type of the entity key word; simlaritytype(p)(ei[p],ej[p]) A matching degree function corresponding to the entity type; ei [ p ]]A parameter value of an entity type of an ith associated entity in the association relationship of the entity keyword E1; ej [ p ]]A parameter value of an entity type of a jth associated entity in the association relationship of the jth entity keyword E2;
if the fourth degree of association is greater than a preset association threshold, identifying any two entity keywords as entity keywords with an alias relationship;
clustering entity keywords with alias relations into one geographic keyword.
In a possible implementation manner of the first aspect, the determining, based on the text label probability corresponding to each candidate geographic keyword, a geographic area label corresponding to the target text from all the candidate geographic keywords includes:
selecting the candidate geographic key words with the maximum text identification probability as geographic area labels of the target text;
after the determining, based on the text tag probability corresponding to each candidate geographic keyword, a geographic area tag corresponding to the target text from all the candidate geographic keywords, the method further includes:
classifying all the target texts based on the geographic area labels to obtain a plurality of area text groups; the geographic region labels of the target text within each of the regional text groups are the same.
In a second aspect, an embodiment of the present application provides an apparatus for recognizing a text label, including:
the candidate geographic keyword determining unit is used for responding to a tag configuration request of a target text and determining candidate geographic keywords contained in the target text through a preset entity identification model;
the feature vector determining unit is used for generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text;
the text label probability calculation unit is used for calculating the text label probability of the candidate geographic keywords according to the feature vectors corresponding to the candidate geographic keywords;
and the geographic area label identification unit is used for determining the geographic area label corresponding to the target text from all the candidate geographic keywords based on the text label probability corresponding to each candidate geographic keyword.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to any one of the above first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the above first aspects.
In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes the electronic device to perform the method of any one of the above first aspects.
Compared with the prior art, the embodiment of the application has the advantages that: when the geographic area label corresponding to the target text needs to be identified, the target text is identified through the entity identification model, candidate geographic keywords contained in the target text are obtained, and the geographic area label of the target text can be selected from the candidate geographic keywords; in order to determine which candidate geographic keyword can represent the content of the target text better, a feature vector of each candidate geographic keyword is determined according to the corresponding occurrence position of each candidate geographic keyword in the target text and the text interaction record of the target text, the text label probability corresponding to each candidate geographic keyword is obtained based on the feature vector, and then the geographic area label is selected from the candidate geographic keywords, so that the purpose of automatically identifying the text geographic area label is achieved. Compared with the existing text label identification technology, the method provided by the embodiment does not need to manually configure the labels in the geographic area, so that the text label identification efficiency is greatly improved, and the labor cost is reduced. On the other hand, when determining the feature vector of each candidate geographic keyword, the embodiment of the application not only considers the occurrence position of the candidate geographic keyword in the target text, determines the importance degree of the candidate geographic keyword on the text content representation through the occurrence position, but also determines the relevance between the object interacted with the target text and the candidate geographic keyword through the interaction record of the target text, thereby improving the richness of the information contained in the feature vector, further improving the accuracy of subsequent identification of the geographic area tag, and further improving the text management efficiency.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart illustrating an implementation of a text label identification method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating an implementation manner of S102 of a text label identification method according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an implementation manner of S1025 of a text label identification method according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating an implementation manner of S1026 of a method for identifying a text label according to an embodiment of the present application;
fig. 5 is a schematic diagram of an implementation manner of S103 of a text label identification method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a network for computing text label probabilities provided by an embodiment of the present application;
fig. 7 is a schematic diagram of an implementation manner of S101 of a text label identification method according to an embodiment of the present application;
fig. 8 is a schematic diagram of an implementation manner of a text label identification method according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an apparatus for a text label identification method according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
The text label identification method provided by the embodiment of the application can be applied to electronic equipment such as a smart phone, a server, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook and the like. The embodiment of the present application does not set any limit to the specific type of the electronic device. In particular, the electronic device may be a text server based on big data, where a large amount of text data is stored in the text server, and tags are configured for each text data, where the tags include, but are not limited to, a content tag, a character tag, and the like of the text data, and in particular, the tags include a geographic area tag for determining a geographic area specifically described by the content of the text data.
Illustratively, an epidemic situation analysis scenario is taken as an example for explanation, a network includes a large number of articles reporting epidemic situations in different regions, when it is determined to know the epidemic situation in a certain region, even in a certain region, the content of the articles needs to be determined, the described region or region is determined, a corresponding geographical region tag is added to the articles, and the articles on the network are classified based on the geographical region tag, so as to conveniently know the epidemic situations in different regions and different regions. Based on this, how to accurately and efficiently identify the geographical region tags of a large amount of text data directly influences the efficiency of epidemic situation management and control. Of course, the addition of the above-described geographical region tag to classify text data can also be applied to the field of determining a public opinion situation and other fields for confirming different regional situations.
Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a text label identification method provided in an embodiment of the present application, where the method includes the following steps:
in S101, in response to a tag configuration request of a target text, candidate geographic keywords included in the target text are determined through a preset entity identification model.
In this embodiment, the electronic device is configured with a text database that stores a large amount of text data. The text data includes the text data added with the label and the target text to be added with the label. The electronic device performs the operation of S101 when receiving a tag configuration request regarding a certain target text. The tag configuration request may be generated based on a user operation, or may be automatically generated.
In a possible implementation manner, the user terminal may send a tag configuration request carrying a target file to the electronic device, and after receiving the tag configuration request, the electronic device may extract a target text carried therein and add a geographical area tag to the target text.
In a possible implementation manner, the tag configuration request may carry a text identifier. And the electronic equipment extracts a corresponding target text from a preset text database based on the text identification, and performs identification operation of the geographical area label.
In a possible implementation manner, the electronic device may download the text data from the internet at a preset period, and perform a tag identification operation on the text data after obtaining the text data. For example, the electronic device may be configured with corresponding text keywords, and text data containing the text keywords may be downloaded from the internet.
In this embodiment, the electronic device is configured with an entity identification model, which is specifically configured to identify entity keywords related to geographic locations, such as "hunan", "shenzhen technological building", and the like, included in the target text, and use the identified entity keywords as candidate geographic keywords corresponding to the target text. The entity recognition model may be specifically generated after a large amount of labeled data is trained, the labeling mode is based on a Basic Input Output System (BIOS), the trained model is trained in a mode of combining a BiLSTM network and a CRF network, and after the training is completed, the model for named entity recognition is used for prediction to obtain the entity recognition model.
In a possible implementation manner, the electronic device may store a plurality of training texts, each training text is labeled with entity keywords related to geography, an existing recognition model is trained and learned based on the labeled training texts, a corresponding loss function is set, and when it is detected that a loss value of the loss function is less than or equal to a preset loss threshold, the recognition model is recognized to be trained, and the entity recognition model is obtained.
In S102, a feature vector corresponding to the candidate geographic keyword is generated based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text.
In this embodiment, the electronic device may determine feature vectors corresponding to different candidate geographic keywords respectively. The feature vector is used to determine the degree of generalization of the candidate geographic keywords to the content of the target text. The feature vector is related to a position where the feature vector appears in the text, text attributes such as the number of times of the candidate geographic keyword counted according to the position of the feature vector appearing, and interaction records corresponding to the text, such as user information for browsing the text, user information for commenting the text, author information corresponding to writing the text, and the like, and the interaction operations include but are not limited to: the interactive information can also determine audiences who concern the target text to a certain extent, and often the audiences with higher text concern are users with strong association relation with the geographical area associated with the text content, so that the geographical area represented by the content of the target text can be determined to a certain extent through the text interactive records.
In a possible implementation manner, the electronic device may import the target text marked with the candidate geographic keyword into a preset conversion model of the position feature data, and calculate to obtain the position feature data determined based on the occurrence position; the electronic equipment can identify whether the candidate geographic keywords are contained in each text interaction record, calculate interaction characteristic data based on the text interaction records with the candidate geographic keywords, and obtain the characteristic vector according to the interaction characteristic data and the position characteristic data.
In S103, the text label probability of the candidate geographic keyword is calculated according to the feature vector corresponding to the candidate geographic keyword.
In this embodiment, after determining the feature vector of the candidate geographic keyword, the electronic device may import the feature vector to a preset tag probability identification network, and calculate to obtain a text tag probability corresponding to the candidate geographic keyword. If the numerical value of the text label probability is larger, the content relevance between the text label probability and the target text is larger; conversely, if the numerical value of the text label probability is smaller, the content relevance between the text label probability and the target text is smaller.
In one possible implementation, the electronic device is configured with an identification network of text labels. The identification network of the text label comprises two modules, namely a feature extraction module and a full connection module. The feature extraction module may be specifically a module configured based on multiple convolution kernels, and performs convolution processing on the feature vectors through the multiple convolution kernels, so as to extract feature values corresponding to the candidate keywords, and introduces the feature values into the full-connection module based on the feature values, so as to calculate a probability of obtaining a text label of the target text, that is, the text label probability. The recognition network can be obtained by training in a big data artificial intelligence learning mode.
It should be noted that the electronic device may determine a feature vector for each candidate geographic keyword, and respectively calculate a text label probability corresponding to each candidate geographic keyword.
In S104, based on the text label probability corresponding to each candidate geographic keyword, a geographic area label corresponding to the target text is determined from all the candidate geographic keywords.
In this embodiment, after calculating the text tag probability of each geographic keyword, the electronic device may identify the geographic area tag of the target text based on the text tag probability.
In a possible implementation manner, the electronic device may select a candidate geographic keyword with the highest text tag probability as the geographic area tag.
In a possible implementation manner, the electronic device may be configured with a corresponding probability threshold, and all the candidate geographic keywords having the text label probability greater than the probability threshold are taken as the geographic area labels.
In a possible implementation manner, the number of the determined geographic area tags may be one, or may be multiple. If the number of the geographic area tags is multiple, each geographic area tag may be in a cascade relationship with each other. For example, the plurality of geographic area tags identified may be: the Guangdong, Shenzhen, Futian and lotus flower streets are in cascade connection, the lotus flower street belongs to one street in the Futian, the Futian belongs to one region in the Shenzhen city, the Shenzhen city belongs to one ground level city in the Guangdong province, and the Shenzhen city has region cascade connection relation with each other and corresponds to different partition granularities, so that text partition operations with different granularities can be responded.
As can be seen from the above, when a geographical area tag corresponding to a target text needs to be identified, the method for identifying a text tag provided by the embodiment of the application identifies the target text through an entity identification model to obtain candidate geographical keywords contained in the target text, and can select the geographical area tag of the target text from the candidate geographical keywords; in order to determine which candidate geographic keyword can represent the content of the target text better, a feature vector of each candidate geographic keyword is determined according to the corresponding occurrence position of each candidate geographic keyword in the target text and the text interaction record of the target text, the text label probability corresponding to each candidate geographic keyword is obtained based on the feature vector, and then the geographic area label is selected from the candidate geographic keywords, so that the purpose of automatically identifying the text geographic area label is achieved. Compared with the existing text label identification technology, the method provided by the embodiment does not need to manually configure the labels in the geographic area, so that the text label identification efficiency is greatly improved, and the labor cost is reduced. On the other hand, when determining the feature vector of each candidate geographic keyword, the embodiment of the application not only considers the occurrence position of the candidate geographic keyword in the target text, determines the importance degree of the candidate geographic keyword on the text content representation through the occurrence position, but also determines the relevance between the object interacted with the target text and the candidate geographic keyword through the interaction record of the target text, thereby improving the richness of the information contained in the feature vector, further improving the accuracy of subsequent identification of the geographic area tag, and further improving the text management efficiency.
Fig. 2 shows a flowchart of a specific implementation of the text label identification method S102 according to the second embodiment of the present invention. Referring to fig. 2, with respect to the embodiment described in fig. 1, in the method for recognizing a text tag provided in this embodiment, S102 includes: s1021 to S1027 are specifically described as follows:
further, the generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text includes:
in S1021, a text feature parameter set of the candidate keyword is determined based on the appearance position.
In this embodiment, the text may be divided into different areas, such as a title area, a subtitle area, a summary area, a body area, a reference area, and the like, where the different areas have different degrees of summarizing the text content, for example, the title area has a higher degree of summarizing the text content; the text area contains a large number of words and belongs to the area described for the expanded content, so that the text content is summarized to a lower degree. Based on the method, if the keywords of a certain geographic area appear in the title area, the information density of the represented text content is higher; if a keyword in a certain geographic area appears in the text area, the information density of the represented text content is low, so that the text characteristic parameter group corresponding to the threshold value can be generated according to the appearance position of the candidate geographic keyword in the target text. The text feature parameter group can contain a feature parameter; and a plurality of characteristic parameters can be contained, which are determined according to actual conditions. For example, if a candidate geographic keyword appears multiple times in the target text, the number of parameters in the text feature parameter group may also be consistent with the number of occurrences, that is, each occurrence position corresponds to one text feature parameter, so that all text feature parameters constitute the text feature parameter group of the candidate geographic keyword.
In a possible implementation manner, the text feature parameter set includes the following four feature parameter values, which are respectively:
1. and determining the occurrence times of the candidate geographic keywords in the target text, and determining a first text characteristic parameter based on the occurrence times.
In this embodiment, one candidate geographic keyword may appear in the text for multiple times, and the electronic device may count the number of occurrences of the certain candidate geographic keyword in the target text, and calculate the first text characteristic parameter based on the number of occurrences. It should be noted that, if one or more aliases exist in the candidate geographic keyword, the number of occurrences of the aliases may also be counted in the number of occurrences of the candidate geographic keyword, so as to obtain the corresponding first text characteristic parameter.
2. And identifying whether the candidate geographic keywords appear in the title area of the target text, and determining a second text characteristic parameter.
3. And identifying whether the candidate geographic keywords appear in the first segment or the last segment of the text area of the target text, and determining a third text characteristic parameter.
4. And identifying whether the candidate geographic key words appear in the abstract or not, and determining a fourth text characteristic parameter.
In this embodiment, according to the importance of the text content, three areas with higher importance can be obtained, which are a header area, a summary area, and a first segment or a last segment in a text area. Since the text content can be highly summarized by the characters in each region, if the candidate geographic keyword appears in the region, the candidate geographic keyword is important for the text content, and therefore, whether the candidate geographic keyword appears in the region or not can be used for determining the text characteristic parameter corresponding to the candidate geographic keyword. If the text feature parameter appears in the area, the corresponding text feature parameter is a first place value; otherwise, if the text feature parameter does not appear in the area, the corresponding text feature parameter is a second place value.
For example, if a candidate geographic keyword appears in the headline region, the second text characteristic parameter of the candidate geographic keyword is 1 (i.e., the first place value), if the candidate geographic keyword does not appear in the abstract region, the fourth text characteristic parameter of the candidate geographic keyword is 0 (i.e., the second place value), and so on.
In S1022, geographic aliases of the candidate geographic keywords are determined, and alias feature parameters are determined based on the number of occurrences of all the geographic aliases in the target text.
In this embodiment, the electronic device may obtain the alias of each candidate geographic keyword according to a preset knowledge graph or a pre-stored alias dictionary, determine whether the alias of each candidate geographic keyword appears in the target text and the number of occurrences of the alias, obtain two feature parameters through analysis, and package the two feature parameters to obtain the alias feature parameters.
In S1023, the number of entities in the target text that have an association relationship with the candidate geographic keyword is identified, and an entity feature parameter is determined.
In this embodiment, the electronic device may count whether a place name associated with the candidate geographic keyword exists in the target text, where the place name may not be the identified candidate geographic keyword but an entity with a higher relevance to the place, such as martian city including huanghe louse, martian changjiang bridge, martian university, epiphyllum, and the like, and obtain the entity characteristic parameter according to the occurrence frequency of the entity associated with the candidate geographic keyword.
In S1024, a word cloud set of the target text is obtained, and semantic feature parameters are determined based on the inclusion relationship between the candidate geographic keywords and the word cloud set.
In this embodiment, the word cloud set is a keyword set that can represent the main content of the target text and is extracted after semantic analysis is performed on the target text. If the candidate geographic keyword appears in the word cloud, the candidate geographic keyword has higher representativeness to the text content of the target text, so that the corresponding semantic feature parameter can be determined based on whether the word cloud set contains the candidate geographic keyword.
In S1025, the posting information of the target text is identified, and a posting feature parameter group is determined based on a first degree of association between the posting information and the candidate keyword.
In this embodiment, the electronic device may obtain release information corresponding to the target text, such as a publisher and a release location, since the release information often has strong correlation with a geographic location of the content described in the text, such as a release object "guangzhou diary", and news of guangzhou native is often reported, the electronic device may indirectly infer the correlation between the candidate first keyword and the geographic area tag by determining the first correlation between the candidate geographic keyword and the release information, and determine the release feature parameter group based on the first correlation.
In S1026, determining a set of interaction feature parameters according to each text interaction record.
In this embodiment, the electronic device may generate a corresponding set of interaction feature parameters according to the association degrees between all text interaction records of the target text and the candidate geographic keywords. For example, whether the interactive content of the text interaction record contains the candidate geographic keyword or the entity associated with the candidate geographic keyword is judged, so as to obtain the corresponding interactive characteristic parameter.
At S1027, the feature vector is generated based on the text feature parameter group, the alias feature parameter group, the entity feature parameter group, the semantic feature parameter group, the distribution feature parameter group, and the interaction feature parameter group.
In this embodiment, the electronic device may encapsulate the plurality of calculated feature parameters, so as to generate a feature vector about the candidate geographic keyword.
In the embodiment of the application, the characteristic parameters of the candidate geographic keywords are determined through multiple dimensions, so that the characteristic vectors of the candidate geographic keywords are generated, the association degree of the candidate geographic keywords and the text content can be judged from the multiple dimensions, and the identification accuracy of subsequent geographic area labels is greatly improved.
Fig. 3 shows a flowchart of a specific implementation of the text label identification method S1025 according to a third embodiment of the present invention. Referring to fig. 3, with respect to the embodiment described in fig. 2, in the method for identifying a text label provided in this embodiment, S1025 includes: s301 to S304 are detailed as follows:
in S301, a publication object of the target text is determined, and a first publication feature value is calculated based on a first distance value between a first geographic location associated with the publication object and a target geographic location corresponding to the candidate keyword.
In this embodiment, the publishing information includes a publishing object and a text author. The publishing object may be an enterprise, a group, an individual, or the like, such as a shenzhen journal, an guangzhou journal, or a public number of guangzhou public security, and of course, if the text author is the publishing object of the target text, the publishing object and the text author may be the same. Each release object may be associated with a corresponding registration location, that is, the first geographic location, and the electronic device may obtain the first release characteristic value according to a first distance value between the target geographic location corresponding to the candidate keyword and the first geographic location associated with the release object. If the first distance value is smaller, the numerical value of the corresponding first release characteristic value is larger.
In S302, a text author of the target text is determined, and a plurality of published texts associated with the text author are obtained.
In this embodiment, the electronic device may obtain all texts that have been published by a text author, that is, published texts, according to the text author associated with the target text.
In S1033, a second posting feature value is calculated based on a second distance value between the target geographic location and a second geographic location corresponding to the existing geographic tag of each posted text; wherein the second release characteristic value is specifically:
Figure BDA0003486470030000111
wherein, Publish2Issuing a feature value for the second publication; distance (HisText)iAddressKey) is a second distance value between the second geographic location of the ith text and the target geographic location; the CurrentTime is the release time of the target text; timeiThe publication time of the ith published text; num ([ HisText ]i]) Is the total number of said published texts; max { Distance (HisText)iAddressKey) is the maximum value selection function.
In this embodiment, each published text is a text to which a geographic area tag has been configured, so that a corresponding second geographic position may be determined by obtaining an existing geographic tag corresponding to each published text, and a distance value between the second geographic position and a target geographic position corresponding to the candidate keyword, that is, a second distance value, is calculated, thereby calculating a second publishing feature value related to publishing. The electronic device may determine a weight corresponding to each published text based on a difference between publication times of the published texts and the target text, and if the publication time of the target text is closer, the corresponding weight is higher.
In S1034, the publishing feature parameter set is determined according to the first publishing feature value and the second publishing feature value.
In this embodiment, the electronic device packages the first distribution characteristic value and the second characteristic value, so as to obtain a distribution characteristic parameter group related to the candidate geographic keyword.
In the embodiment of the application, by determining the published text of the published object and the text author and respectively determining the first published characteristic value and the second published characteristic value related to the publication, the comprehensiveness of the published characteristics can be improved, and the accuracy of the subsequent calculation of the text label probability is improved.
Fig. 4 shows a flowchart of a specific implementation of the text label identification method S1026 according to a fourth embodiment of the present invention. Referring to fig. 4, with respect to the embodiment described in fig. 2, S1026 in the method for identifying a text label provided in this embodiment includes: s401 to S404 are specifically detailed as follows:
further, the text interaction records comprise text browsing records and text comment records; the determining the set of interactive feature parameters according to each text interaction record includes:
in S401, first user information of a browsing object of each text browsing record is determined, and the second association degree is determined according to the first user information and the candidate keyword.
In S402, comment contents of comment objects of the respective text comment records are determined, and a third degree of association is determined based on the comment contents and the candidate keywords.
In S403, the set of interactive feature parameters is generated according to the second degree of association and the third degree of association.
In this embodiment, the electronic device may obtain first user information of a user viewing the target text, extract a network address from the first user information, determine a location of the user browsing the target text through the network address, and determine the second association degree by calculating a distance between the location of the user and a target geographic position corresponding to the candidate geographic keyword.
In a possible implementation manner, after determining the first user information of each text browsing record, the electronic device may count a ratio of locations of users browsing the target text, select a location where the first N viewing persons are located most as a browsing representative location, and calculate the second association degree based on a distance value between target geographic locations corresponding to the browsing representative first and the candidate geographic keyword.
In this embodiment, similar to determining the second association degree, the electronic device may determine, according to the text comment record, a user location corresponding to each user who comments the target text, calculate a distance between the comment user location and a target geographic position corresponding to the candidate geographic keyword, and determine the third association degree. For a specific description, reference may be made to the description of the second association degree, which is not described herein again.
In this embodiment, the electronic device may encapsulate the second association degree and the third association degree, so as to obtain the set of interaction feature parameters related to the interaction behavior.
In the embodiment of the application, the association degree between the geographic position of the interactive object and the candidate geographic key words is determined, the interactive feature parameter group is determined, whether the candidate geographic key words are related to the geographic area of the target text or not can be determined through the interactive object, the information richness of the feature vector is further improved, and the accuracy of the probability of the subsequent text label is further improved.
Fig. 5 shows a flowchart of a specific implementation of the text label identification method S103 according to a fifth embodiment of the present invention. Referring to fig. 5, with respect to the embodiment described in fig. 1, in the method for identifying a text label provided in this embodiment, S103 includes: s1031 to S1034 are specifically described as follows:
in S1051, a feature reference value corresponding to each feature value in the feature vector is determined, and each feature value is normalized according to the feature reference pair.
In S1052, a normalized feature vector is obtained based on the normalized feature value.
In S1053, the normalized feature vector is introduced into a preset prediction module, and a global feature vector is generated.
In S1054, the global feature vector is imported to a preset trend evaluation module, and the text label probability is calculated.
In this embodiment, the electronic device may perform normalization processing on each feature value in the feature vector, so as to eliminate the influence of different feature dimensions on the result, and the specific normalization rule may be determined according to the physical characteristic of the corresponding feature value, that is, a feature reference value of each feature value in the feature vector is determined, for example, the feature vector may be normalized by using a softmax function, so as to obtain a normalized feature vector.
In a possible implementation manner, before the electronic device inputs the information to the network formed by the prediction module and the trend evaluation module, the electronic device may train the two modules, specifically: the electronic device imports training data into a network, generates a decision tree corresponding to the training data by feature extraction and data compression of the training data, and the decision tree includes output and prediction module vectors which can be output by the decision tree, and then the output of the prediction module is fed back to a trend evaluation module, so that a corresponding global vector is obtained by calculation, and besides processing the training data, corresponding verification data is configured, and verification scores corresponding to the verification data are output based on the prediction module, and parameters of each module in the frame are adjusted through the verification scores and the global vector, so that a trained network, namely the prediction module and the trend evaluation module, is obtained. As shown in fig. 6, the network includes a prediction module and a trend evaluation module, and can be trained by training data, and convert the feature vector after training to obtain the text label probability.
In the embodiment of the application, before the text label probability is calculated, normalization processing is performed on each feature value in the feature vector, so that the influence caused by dimension can be eliminated, and the accuracy of subsequent calculation is further improved.
Fig. 7 shows a flowchart of a specific implementation S101 of a text label identification method according to a sixth embodiment of the present invention. Referring to fig. 7, with respect to the embodiment described in any one of fig. 1 to 5, a method S101 for recognizing a text label provided by this embodiment includes: s1011 to S1016 are specifically described as follows:
in S1011, in response to the tag configuration request of the target text, the target text is imported into the entity identification model, and the entity keyword corresponding to the target text is determined.
In S1012, entity keywords having a co-occurrence relationship in the target text are identified, and an association relationship between the entity keywords is determined.
In S1013, a knowledge graph is generated based on the association relationship between the entity keywords.
In S1014, a fourth degree of association between any two entity keywords is calculated; the fourth degree of association is:
Sim(E1,E2)=∑ei∈Context(E1),ej∈Context(E2)maxsimentity(ei,ej);
simentity(ei,ej)=∑p∈Prop(ei)∩Prop(ej)ωpSimlaritytype(p)(ei[p],ej[p])
wherein Sim (E1, E2) is the fourth degree of association between the two entity keywords; context (E1) is an associated entity of the entity keyword E1 having the association relation in the knowledge graph; context (E2) is an associated entity of the entity keyword E2 having the association relation in the knowledge graph; ei is the ith associated entity in the association of the entity keyword E1; ej is the jth associated entity in the association relationship of the entity keyword E2; prop (ei) is the entity type of the ith associated entity in the association relationship of the entity keyword E1; prop (ej) is the entity type of the jth associated entity in the association relationship of the entity keyword E2; omegapThe weight value is corresponding to the entity type of the entity key word; simlaritytype(p)(ei[p],ej[p]) A matching degree function corresponding to the entity type; ei [ p ]]A parameter value of an entity type of an ith associated entity in the association relationship of the entity keyword E1; ej [ p ]]The parameter value of the entity type of the jth associated entity in the association relationship of the jth entity keyword E2.
In S1015, if the fourth degree of association is greater than a preset association threshold, the two entity keywords are identified as entity keywords having an alias relationship.
In S1016, the entity keywords having the alias relationship are clustered into one of the geographic keywords.
In this embodiment, the electronic device determines, through the entity recognition model, that the target text contains entity keywords, and adds corresponding nodes to each entity keyword in a preset knowledge graph. If the two entity keywords are in the same sentence or the same language segment, recognizing that the two entity keywords have a co-occurrence relationship, or determining that the two entity keywords have the co-occurrence relationship according to the fact that a connecting word between the two entity keywords is a preset effective connecting word, and if the two entity keywords having the co-occurrence relationship, connecting nodes corresponding to the two entity keywords in a knowledge graph, namely, the two entity keywords have an association relationship, so that each isolated node is connected, and the knowledge graph based on all the entity keywords is generated. The electronic device may calculate a fourth degree of association between the entity keywords with respect to the knowledge graph, and if the fourth degree of association between two entity associated words is greater than a preset association threshold, identify that an alias relationship exists between the two entity keywords, and cluster the two entity keywords into one keyword as a candidate keyword.
In the embodiment of the application, after the entity keywords related to the geography are identified, the alias identification is carried out, and the entity keywords with the alias relation are clustered to obtain the candidate geography keywords, so that the situation that the probability of text labels is calculated by referring to different keywords of the same object can be avoided, the importance degree of the keywords is diluted, and the identification accuracy of the follow-up identification of the geography area labels is improved.
Fig. 8 is a flowchart illustrating a specific implementation of a text label identification method according to a sixth embodiment of the present invention. Referring to fig. 8, with respect to the embodiment described in any one of fig. 1 to 5, the method S104 for recognizing a text label provided in this embodiment includes: s801, after S104, further includes S802, which is detailed as follows:
in S801, the candidate geographic keyword with the highest text identification probability is selected as the geographic area tag of the target text.
In S802, classifying all the target texts based on the geographic area tags to obtain a plurality of area text groups; the geographic region labels of the target text within each of the regional text groups are the same.
In this embodiment, the electronic device may select a candidate geographic keyword with the largest text identification probability as a geographic area tag of the target text, and then may classify all the target texts based on the geographic area tag, and divide all the target texts belonging to the same geographic area tag into a regional text group, so that the user can know the situation of a certain geographic area, such as an epidemic situation, a public opinion situation, and a hot event that occurs, through the regional text group.
In the embodiment of the application, the texts are classified through the geographic area labels, so that a user can conveniently know the condition of a specific area, and the text searching efficiency is improved.
Fig. 9 is a block diagram illustrating a structure of a text label identification method apparatus according to an embodiment of the present invention, where the electronic device includes units for executing steps in the embodiment corresponding to fig. 1. Please refer to fig. 1 and fig. 1 for the corresponding description of the embodiment. For convenience of explanation, only the portions related to the present embodiment are shown.
Referring to fig. 9, the text label identification method apparatus includes:
the candidate geographic keyword determining unit 91 is configured to determine, in response to a tag configuration request of a target text, a candidate geographic keyword included in the target text through a preset entity identification model;
a feature vector determining unit 92, configured to generate a feature vector corresponding to the candidate geographic keyword based on a text interaction record corresponding to the target text and an occurrence position of the candidate geographic keyword in the target text;
a text label probability calculating unit 93, configured to calculate a text label probability of the candidate geographic keyword according to the feature vector corresponding to the candidate geographic keyword;
a geographic area tag identification unit 94, configured to determine, based on the text tag probability corresponding to each candidate geographic keyword, a geographic area tag corresponding to the target text from all the candidate geographic keywords.
Optionally, the feature vector determining unit 92 includes:
a text feature parameter set determining unit, configured to determine a text feature parameter set of the candidate keyword based on the occurrence position;
an alias characteristic parameter determination unit, configured to determine a geographic alias of the candidate geographic keyword, and determine an alias characteristic parameter based on the number of occurrences of all geographic aliases in the target text;
the entity characteristic parameter determining unit is used for identifying the number of entities in the target text, which have association relation with the candidate geographic keywords, and determining entity characteristic parameters;
the semantic feature parameter determining unit is used for acquiring a word cloud set of the target text and determining semantic feature parameters based on the inclusion relation between the candidate geographic keywords and the word cloud set;
a distribution characteristic parameter group determining unit, configured to identify distribution information of the target text, and determine a distribution characteristic parameter group based on a first degree of association between the distribution information and the candidate keyword;
the interactive feature parameter group determining unit is used for determining an interactive feature parameter group according to each text interactive record;
a parameter encapsulation unit, configured to generate the feature vector based on the text feature parameter group, the alias feature parameter, the entity feature parameter, the semantic feature parameter, the release feature parameter group, and the interaction feature parameter group.
Optionally, the release feature parameter group determining unit includes:
the first publishing characteristic value determining unit is used for determining a publishing object of the target text and calculating a first publishing characteristic value based on a first distance value between a first geographic position associated with the publishing object and a target geographic position corresponding to the candidate keyword;
the published text acquisition unit is used for determining a text author of the target text and acquiring a plurality of published texts associated with the text author;
the second release characteristic value determining unit is used for calculating a second release characteristic value based on a second distance value between a second geographic position corresponding to the existing geographic label of each released text and the target geographic position; wherein the second release characteristic value is specifically:
Figure BDA0003486470030000151
wherein, Publish2Issuing a feature value for the second publication; distance (HisText)iAddressKey) is a second distance value between the second geographic location of the ith text and the target geographic location; the CurrentTime is the release time of the target text; timeiThe publication time of the ith published text; num ([ HisText ]i]) Is the total number of said published texts; max { Distance (HisText)iAddressKey) is a maximum value selecting function;
and the release characteristic value packaging unit is used for determining the release characteristic parameter group according to the first release characteristic value and the second release characteristic value.
Optionally, the text interaction record comprises a text browsing record and a text comment record;
the interactive feature parameter group determining unit includes:
a second association degree determining unit, configured to determine first user information of a browsing object of each text browsing record, and determine the second association degree according to the first user information and the candidate keyword;
a third association degree determining unit, configured to determine comment contents of comment objects in the text comment records, and determine a third association degree based on the comment contents and the candidate keywords;
and the association degree packaging unit is used for generating the interactive feature parameter group according to the second association degree and the third association degree.
Optionally, the text label probability calculating unit 93 includes:
the normalization processing unit is used for determining a characteristic reference value corresponding to each characteristic value in the characteristic vector and respectively carrying out normalization processing on each characteristic value according to the characteristic reference pair;
the normalized vector generating unit is used for obtaining normalized feature vectors based on the normalized feature values;
the global characteristic vector determining unit is used for leading the normalized characteristic vector into a preset prediction module to generate a global characteristic vector;
and the text label probability conversion unit is used for leading the global feature vector into a preset trend evaluation module and calculating to obtain the text label probability.
Optionally, the candidate geographic keyword determination unit 91 includes:
the entity keyword identification unit is used for responding to a tag configuration request of a target text, importing the target text into an entity identification model and determining an entity keyword corresponding to the target text;
the incidence relation identification unit is used for identifying entity keywords with co-occurrence relation in the target text and determining the incidence relation among the entity keywords;
a knowledge graph generating unit, configured to generate a knowledge graph based on the association relationship between the entity keywords;
the fourth relevancy calculation unit is used for calculating the fourth relevancy between any two entity keywords; the fourth degree of association is:
Sim(E1,E2)=∑ei∈Context(E1),ej∈Context(E2)maxsimentity(ei,ej);
simentity(ei,ej)=∑p∈Prop(ei)∩Prop(ej)ωpSimlaritytype(p)(ei[p],ej[p])
wherein Sim (E1, E2) is the fourth degree of association between the two entity keywords; context (E1) is an associated entity of the entity keyword E1 having the association relation in the knowledge graph; context (E2) is an associated entity of the entity keyword E2 having the association relation in the knowledge graph; ei is the association of the entity keyword E1The ith associated entity in the system; ej is the jth associated entity in the association relationship of the entity keyword E2; prop (ei) is the entity type of the ith associated entity in the association relationship of the entity keyword E1; prop (ej) is the entity type of the jth associated entity in the association relationship of the entity keyword E2; omegapThe weight value is corresponding to the entity type of the entity key word; simlaritytype(p)(ei[p],ej[p]) A matching degree function corresponding to the entity type; ei [ p ]]A parameter value of an entity type of an ith associated entity in the association relationship of the entity keyword E1; ej [ p ]]A parameter value of an entity type of a jth associated entity in the association relationship of the jth entity keyword E2;
the alias relationship identification unit is used for identifying any two entity keywords as entity keywords with alias relationships if the fourth degree of association is greater than a preset association threshold;
and the entity keyword clustering unit is used for clustering the entity keywords with the alias relationship into one geographic keyword.
Optionally, the geographic area tag identification unit 94 includes:
the text identification probability maximum selection unit is used for selecting the candidate geographic keyword with the maximum text identification probability as a geographic area label of the target text;
the text label recognition device further comprises:
the text classification unit is used for classifying all the target texts based on the geographic area labels to obtain a plurality of area text groups; the geographic region labels of the target text within each of the regional text groups are the same.
Therefore, the method and the device for identifying the text labels provided by the embodiment of the invention can also be used for configuring the labels in the geographic area without manpower, thereby greatly improving the efficiency of identifying the text labels and reducing the labor cost. On the other hand, when determining the feature vector of each candidate geographic keyword, the embodiment of the application not only considers the occurrence position of the candidate geographic keyword in the target text, determines the importance degree of the candidate geographic keyword on the text content representation through the occurrence position, but also determines the relevance between the object interacted with the target text and the candidate geographic keyword through the interaction record of the target text, thereby improving the richness of the information contained in the feature vector, further improving the accuracy of subsequent identification of the geographic area tag, and further improving the text management efficiency.
It should be understood that, in the structural block diagram of the text label identification method device shown in fig. 9, each module is used to execute each step in the embodiment corresponding to fig. 1 to 8, and each step in the embodiment corresponding to fig. 1 to 8 has been explained in detail in the above embodiment, and specific reference is made to the relevant description in the embodiment corresponding to fig. 1 to 8 and fig. 1 to 8, which is not repeated herein.
Fig. 10 is a block diagram of an electronic device according to another embodiment of the present application. As shown in fig. 10, the electronic apparatus 1000 of this embodiment includes: a processor 1010, a memory 1020, and a computer program 1030, such as a program for a text label recognition method, stored in the memory 1020 and executable on the processor 1010. The processor 1010, when executing the computer program 1030, implements the steps in the embodiments of the method for identifying text labels described above, such as S101 to S105 shown in fig. 1. Alternatively, when the processor 1010 executes the computer program 1030, the functions of the modules in the embodiment corresponding to fig. 10, for example, the functions of the units 91 to 94 shown in fig. 9, are implemented, and refer to the related description in the embodiment corresponding to fig. 9 specifically.
Illustratively, the computer program 1030 may be partitioned into one or more modules, which are stored in the memory 1020 and executed by the processor 1010 to accomplish the present application. One or more of the modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of the computer program 1030 in the electronic device 1000. For example, the computer program 1030 may be divided into respective unit modules, and the respective modules may be specifically functioned as described above.
The electronic device 1000 may include, but is not limited to, a processor 1010, a memory 1020. Those skilled in the art will appreciate that fig. 10 is merely an example of an electronic device 1000 and does not constitute a limitation of the electronic device 1000 and may include more or fewer components than illustrated, or combine certain components, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.
The processor 1010 may be a central processing unit, or may be other general-purpose processor, a digital signal processor, an application specific integrated circuit, an off-the-shelf programmable gate array or other programmable logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.
The storage 1020 may be an internal storage unit of the electronic device 1000, such as a hard disk or a memory of the electronic device 1000. The memory 1020 may also be an external storage device of the electronic device 1000, such as a plug-in hard disk, a smart card, a flash memory card, etc. provided on the electronic device 1000. Further, the memory 1020 may also include both internal and external storage units of the electronic device 1000.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A text label identification method is characterized by comprising the following steps:
responding to a tag configuration request of a target text, and determining candidate geographic keywords contained in the target text through a preset entity identification model;
generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text;
calculating the text label probability of the candidate geographic keywords according to the feature vectors corresponding to the candidate geographic keywords;
and determining the geographic area label corresponding to the target text from all the candidate geographic keys based on the text label probability corresponding to each candidate geographic key.
2. The method for recognizing the geographic keyword as claimed in claim 1, wherein the generating a feature vector corresponding to the geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the geographic keyword in the target text comprises:
determining a text characteristic parameter group of the candidate keywords based on the appearance position;
determining geographical aliases of the candidate geographical keywords, and determining alias characteristic parameters based on the occurrence times of all the geographical aliases in the target text;
identifying the number of entities in the target text, which have incidence relation with the candidate geographic keywords, and determining entity characteristic parameters;
acquiring a word cloud set of the target text, and determining semantic feature parameters based on the inclusion relation between the candidate geographic keywords and the word cloud set;
identifying the release information of the target text, and determining a release characteristic parameter group based on a first association degree between the release information and the candidate keywords;
determining an interactive characteristic parameter group according to each text interaction record;
generating the feature vector based on the text feature parameter set, the alias feature parameter, the entity feature parameter, the semantic feature parameter, the release feature parameter set, and the interaction feature parameter set.
3. The method according to claim 2, wherein the identifying the posting information of the target text and determining the posting feature parameter set based on a first degree of association between the posting information and the candidate keyword comprises:
determining a publishing object of the target text, and calculating a first publishing characteristic value based on a first distance value between a first geographic position associated with the publishing object and a target geographic position corresponding to the candidate keyword;
determining a text author of the target text, and acquiring a plurality of published texts associated with the text author;
calculating a second release characteristic value based on a second distance value between a second geographic position corresponding to the existing geographic label of each released text and the target geographic position; wherein the second release characteristic value is specifically:
Figure FDA0003486470020000021
wherein, Publish2Issuing a feature value for the second publication; distance (HisText)iAddressKey) is a second distance value between the second geographic location of the ith text and the target geographic location; the CurrentTime is the release time of the target text; timeiThe publication time of the ith published text; num ([ HisText ]i]) Is the total number of said published texts; max { Distance (HisText)iAddressKey) is a maximum value selecting function;
and determining the release characteristic parameter group according to the first release characteristic value and the second release characteristic value.
4. The recognition method of claim 2, wherein the text interaction records comprise a text browsing record and a text comment record;
the determining the set of interactive feature parameters according to each text interaction record includes:
determining first user information of browsing objects of the text browsing records, and determining a second association degree according to the first user information and the candidate keywords;
determining comment content of a comment object of each text comment record, and determining a third degree of association based on the comment content and the candidate keywords;
and generating the set of interactive feature parameters according to the second relevance and the third relevance.
5. The method according to claim 1, wherein the calculating the text label probability of the candidate geographic keyword according to the feature vector corresponding to the candidate geographic keyword comprises:
determining a characteristic reference value corresponding to each characteristic value in the characteristic vector, and respectively carrying out normalization processing on each characteristic value according to the characteristic reference pair;
obtaining a normalized feature vector based on the normalized feature value;
leading the normalized feature vector into a preset prediction module to generate a global feature vector;
and importing the global feature vector into a preset trend evaluation module, and calculating to obtain the text label probability.
6. The identification method according to any one of claims 1 to 5, wherein the determining, by a preset entity identification model, the candidate geographic keywords contained in the target text in response to the tag configuration request of the target text comprises:
responding to a tag configuration request of a target text, importing the target text into an entity identification model, and determining an entity keyword corresponding to the target text;
identifying entity keywords with co-occurrence relation in the target text, and determining the incidence relation among the entity keywords;
generating a knowledge graph based on the incidence relation among the entity keywords;
calculating a fourth degree of association between any two entity keywords; the fourth degree of association is:
Sim(E1,E2)=∑ei∈Context(E1),ej∈Context(E2)maxsimentity(ei,ej);
simentity(ei,ej)=∑p∈Prop(ei)∩Prop(ej)ωpSimlaritytype(p)(ei[p],ej[p])
wherein Sim (E1, E2) is the fourth degree of association between the two entity keywords; context (E1) is an associated entity of the entity keyword E1 having the association relation in the knowledge graph; context (E2) is an associated entity of the entity keyword E2 having the association relation in the knowledge graph; ei is the ith associated entity in the association of the entity keyword E1; ej is the jth associated entity in the association relationship of the entity keyword E2; prop (ei) is the entity type of the ith associated entity in the association relationship of the entity keyword E1; prop (ej) is the entity type of the jth associated entity in the association relationship of the entity keyword E2; omegapThe weight value is corresponding to the entity type of the entity key word; simlaritytype(p)(ei[p],ej[p]) A matching degree function corresponding to the entity type; ei [ p ]]A parameter value of an entity type of an ith associated entity in the association relationship of the entity keyword E1; ej [ p ]]A parameter value of an entity type of a jth associated entity in the association relationship of the jth entity keyword E2;
if the fourth degree of association is greater than a preset association threshold, identifying any two entity keywords as entity keywords with an alias relationship;
clustering entity keywords with alias relations into one geographic keyword.
7. The method according to any one of claims 1 to 5, wherein the determining the geographic region tag corresponding to the target text from all the candidate geographic keywords based on the text tag probability corresponding to each of the candidate geographic keywords comprises:
selecting the candidate geographic key words with the maximum text identification probability as geographic area labels of the target text;
after the determining, based on the text tag probability corresponding to each candidate geographic keyword, a geographic area tag corresponding to the target text from all the candidate geographic keywords, the method further includes:
classifying all the target texts based on the geographic area labels to obtain a plurality of area text groups; the geographic region labels of the target text within each of the regional text groups are the same.
8. An apparatus for recognizing a text label, comprising:
the candidate geographic keyword determining unit is used for responding to a tag configuration request of a target text and determining candidate geographic keywords contained in the target text through a preset entity identification model;
the feature vector determining unit is used for generating a feature vector corresponding to the candidate geographic keyword based on the text interaction record corresponding to the target text and the occurrence position of the candidate geographic keyword in the target text;
the text label probability calculation unit is used for calculating the text label probability of the candidate geographic keywords according to the feature vectors corresponding to the candidate geographic keywords;
and the geographic area label identification unit is used for determining the geographic area label corresponding to the target text from all the candidate geographic keywords based on the text label probability corresponding to each candidate geographic keyword.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202210082518.9A 2022-01-24 2022-01-24 Text label identification method and device, electronic equipment and storage medium Pending CN114416998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210082518.9A CN114416998A (en) 2022-01-24 2022-01-24 Text label identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210082518.9A CN114416998A (en) 2022-01-24 2022-01-24 Text label identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114416998A true CN114416998A (en) 2022-04-29

Family

ID=81277793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210082518.9A Pending CN114416998A (en) 2022-01-24 2022-01-24 Text label identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114416998A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170040A (en) * 2022-09-08 2022-10-11 南方电网数字电网研究院有限公司 Method and system for dynamically updating asset directory
CN115248837A (en) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 Data processing system for obtaining geographic entity of text
CN115757565A (en) * 2023-01-09 2023-03-07 无锡容智技术有限公司 Text data geographic position positioning method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170040A (en) * 2022-09-08 2022-10-11 南方电网数字电网研究院有限公司 Method and system for dynamically updating asset directory
CN115248837A (en) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 Data processing system for obtaining geographic entity of text
CN115248837B (en) * 2022-09-21 2022-12-23 中科雨辰科技有限公司 Data processing system for obtaining geographic entity of text
CN115757565A (en) * 2023-01-09 2023-03-07 无锡容智技术有限公司 Text data geographic position positioning method and device

Similar Documents

Publication Publication Date Title
CN106951422B (en) Webpage training method and device, and search intention identification method and device
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
WO2020207074A1 (en) Information pushing method and device
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN112148889A (en) Recommendation list generation method and device
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
US20130060769A1 (en) System and method for identifying social media interactions
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN108090216B (en) Label prediction method, device and storage medium
CN110390044B (en) Method and equipment for searching similar network pages
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN109165975B (en) Label recommending method, device, computer equipment and storage medium
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
WO2012158572A2 (en) Exploiting query click logs for domain detection in spoken language understanding
CN109947903B (en) Idiom query method and device
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN109271624B (en) Target word determination method, device and storage medium
CN110968686A (en) Intention recognition method, device, equipment and computer readable medium
CN113626704A (en) Method, device and equipment for recommending information based on word2vec model
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN113792131B (en) Keyword extraction method and device, electronic equipment and storage medium
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220922

Address after: Room 2601 (Unit 07), Qianhai Free Trade Building, No. 3048, Xinghai Avenue, Nanshan Street, Qianhai Shenzhen-Hong Kong Cooperation Zone, Shenzhen, Guangdong 518000

Applicant after: Shenzhen Ping An Smart Healthcare Technology Co.,Ltd.

Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000

Applicant before: Ping An International Smart City Technology Co.,Ltd.

TA01 Transfer of patent application right