CN109271640B

CN109271640B - Text information region attribute identification method and device and electronic equipment

Info

Publication number: CN109271640B
Application number: CN201811348717.XA
Authority: CN
Inventors: 邓文超; 郑茂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2021-09-17
Anticipated expiration: 2038-11-13
Also published as: CN109271640A

Abstract

The invention discloses a text information region attribute identification method and device and electronic equipment, wherein the method comprises the following steps: performing regional judgment on the text information to be recognized through the established regional judgment model; when the text information is judged to be regional, configuring different numerical values for regional words in the text information according to the sequence positions in the text information; according to the hierarchical relation of the administrative regions, fusing numerical values corresponding to regional words belonging to the same administrative region to obtain a regional word numerical value fusion result of the administrative region; and comparing the region word numerical value fusion results of all administrative regions of the same level layer by layer, determining the administrative region to which the text information belongs, and obtaining the region attribute of the text information. Therefore, even if the text information comprises a plurality of region words, the region attributes of the text information can be accurately identified by configuring numerical values for the region words and fusing the numerical values of the region words, and further personalized recommendation of the text information can be realized.

Description

Text information region attribute identification method and device and electronic equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a text information region attribute identification device and an electronic device.

Background

The intelligent recommendation is a sub-field in the artificial intelligence field, and the intelligent recommendation refers to recommending information matched with characteristics of a user to the user according to the characteristics of the user. For example, according to the region where the user is located, information related to the region is pushed to the user. Therefore, identifying the regional attributes of the information is an urgent problem to be solved.

At present, a word segmentation technology is generally adopted to segment text information into a plurality of word groups, the word groups are compared with a region word bank of each administrative region defined in advance, a matching threshold value is set, the region word bank larger than the matching threshold value is found out, and an event occurring in the administrative region corresponding to the region word bank of the text information is judged and obtained.

However, if a plurality of regional words exist in the text message book, for example, the text message includes the regional words such as beijing, shanghai, guangzhou, shenzhen, etc., it is impossible to determine the event that the text message belongs to which administrative region in a simple region and lexicon matching manner, and it is impossible to accurately identify the regional attribute of the text message.

Disclosure of Invention

The invention provides a text information region attribute identification method, which aims to solve the problem that the region attribute of text information cannot be accurately identified in the related technology.

In one aspect, the present invention provides a method for identifying a region attribute of text information, including:

performing regional judgment on the text information to be recognized through the established regional judgment model;

when the text information is judged to be regional, configuring different numerical values for regional words in the text information according to the sequence positions in the text information;

according to the hierarchical relation of administrative regions, fusing numerical values corresponding to regional words belonging to the same administrative region to obtain a regional word numerical value fusion result of the administrative region;

and comparing the regional word numerical value fusion results of all administrative regions of the same level layer by layer, determining the administrative region to which the text information belongs, and obtaining the regional attribute of the text information.

In another aspect, the present invention provides a text information region attribute recognition apparatus, including:

the regional judgment module is used for performing regional judgment on the text information to be recognized through the established regional judgment model;

the numerical value configuration module is used for configuring different numerical values for the regional words in the text information according to the sequence positions in the text information when the text information is judged to be regional;

the numerical value fusion module is used for fusing numerical values corresponding to regional words belonging to the same administrative region according to the hierarchical relation of the administrative region to obtain a regional word numerical value fusion result of the administrative region;

and the region determining module is used for comparing region word value fusion results of all administrative regions of the same level layer by layer, determining the administrative region to which the text information belongs and obtaining the region attribute of the text information.

In addition, the present invention also provides an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the region attribute identification method of the text information.

Further, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to perform the region attribute identification method for text information.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

according to the technical scheme provided by the invention, aiming at regional text information, different numerical values are configured according to the sequence positions of regional words in the text information, and are set according to the hierarchical relation of administrative regions, the numerical values corresponding to the regional words in the same administrative region are fused to obtain the regional word numerical value fusion result of the administrative region, and the administrative regions to which the text information belongs can be determined by comparing the regional word numerical value fusion results of all the administrative regions in the same hierarchy layer by layer to obtain the regional attributes of the text information. Therefore, even if the text information comprises a plurality of region words, the region attributes of the text information can be accurately identified by configuring numerical values for the region words and fusing the numerical values of the region words, and further personalized recommendation of the text information can be realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention;

FIG. 2 is a block diagram illustrating a server in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for geographic attribute identification of textual information, according to an example embodiment;

fig. 4 is a flowchart of a method for identifying a regional attribute of text information according to another embodiment based on the corresponding embodiment in fig. 3;

FIG. 5 is a schematic diagram illustrating the training principle of the fast text model according to an embodiment;

fig. 6 is a flowchart of a text message region attribute identification method according to yet another embodiment based on the corresponding embodiment in fig. 3;

FIG. 7 is a detailed flowchart of step 370 in the corresponding embodiment of FIG. 3;

fig. 8 is a detailed flowchart of a text information region attribute identification method according to an exemplary embodiment of the present invention;

FIG. 9 is a flow chart illustrating the implementation of personalized text information recommendation using the present invention;

fig. 10 is a block diagram illustrating a text information region attribute identification apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a schematic diagram illustrating an implementation environment to which the present invention relates, according to an exemplary embodiment. The implementation environment to which the present invention relates includes a server 110. The database of the server 110 may store the text information to be identified, so that the server 110 may perform the region attribute identification on the text information by using the region attribute identification method of the text information provided by the present invention, and determine the news information of which administrative region the text information belongs to.

The implementation environment may also include a data source that provides data, i.e., textual information, as desired. In particular, in this implementation environment, the data source may be the mobile terminal 130. The server 110 may obtain the text information uploaded by the mobile terminal 130 in advance, and further perform region attribute identification on the text information by using the method provided by the present invention.

It should be noted that the method for identifying the regional attribute of the text message provided by the present invention is not limited to deploying corresponding processing logic in the server 110, and may also be deployed in other machines. For example, processing logic for performing region attribute recognition on text information is deployed in a terminal device with computing capability.

In a specific application, the server 110 may recommend the text information to the user according to the region attribute of the text information and the region where the user is located, where the region attribute is matched with the region where the user is located, so as to implement personalized recommendation of the news information.

Referring to fig. 2, fig. 2 is a schematic diagram of a server structure according to an embodiment of the present invention. The server 200 may vary significantly depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 222 (e.g., one or more processors) and memory 232, one or more storage media 230 (e.g., one or more mass storage devices) storing applications 242 or data 244. Memory 232 and storage medium 230 may be, among other things, transient or persistent storage. The program stored in the storage medium 230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server 200. Still further, the central processor 222 may be configured to communicate with the storage medium 230 to execute a series of instruction operations in the storage medium 230 on the server 200. Server 200 may also includeIncluding one or more power supplies 226, one or more wired or wireless network interfaces 250, one or more input-output interfaces 258, and/or one or more operating systems 241, such as a Windows Server^TM，Mac OS ^XTM，UnixTM,Linux^TM，FreeBSD^TMAnd so on. The steps performed by the server described in the embodiments of fig. 3, 4, 6-9 below may be based on the server architecture shown in fig. 2.

It will be understood by those skilled in the art that all or part of the steps for implementing the following embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Fig. 3 is a flowchart illustrating a method for identifying a regional attribute of text information according to an exemplary embodiment. The scope and execution subject of the text information region attribute identification method may be a server, which may be the server 110 in the implementation environment shown in fig. 1. As shown in fig. 3, the reminding method, which may be executed by the mobile terminal 110, may include the following steps.

In step 310, the text information to be recognized is regionally judged through the constructed regional judgment model.

The text information to be recognized refers to information presented in a text form with unknown regional attributes, such as a news text. The text information to be recognized may be stored in the local database of the server in advance, or may be obtained by the server from other terminal devices in advance. The regional judgment refers to the judgment of whether the text information belongs to an event occurring in some administrative areas. When the text information is judged to be regional, the subsequent steps can identify regional attributes such as province, city, district/county, business district and the like corresponding to the text information.

The region determination model is used to determine whether or not the text information is regional. The region determination model may be constructed in advance, and specifically, the region determination model may be obtained by performing machine learning using a large amount of text data known whether the region exists or not, and training parameters of an LR (logistic regression model), a fast text classifier, or a GBDT (gradient boosting decision tree model). And finally, carrying out regional judgment on the text information by using the trained regional judgment model.

Taking the case where the region determination model is obtained by training the parameters of the LR model, when performing regional determination, the feature data of the text information may be input to the region determination model, and a determination result of whether or not the region is present may be output. Taking the example of training the parameters of the fasttext (fast text classifier) to obtain a regional judgment model, when performing regional judgment, each vocabulary of the text information can be input into the fasttext model, and a judgment result of whether the text information has the region is output.

In step 330, when it is determined that the text message is regional, different numerical values are configured for the regional words in the text message according to the sequence positions in the text message.

The regional words refer to words representing regions in the text information, such as XX province, XX city, XX county, XX district, XX business district, and the like. Specifically, when the text information is determined to be regional, the region words in the text information can be found out according to the region word library, then the region words at different positions are distinguished according to the sequence positions of the region words in the text information, different numerical values (i.e., weighting) are configured for the region words in the title, the media name and the text, numerical attenuation is performed for the region words in the text along with the positions of the region words in the text, so that the numerical value of the region word at the front is ensured to be larger, that is, the numerical value of the region word at the front in the text information is larger, and the numerical value of the region word at the back is smaller.

For example, the number of the region word arrangement appearing in the title is 50, the number of the region word arrangement appearing in the media name is 30, and the number of the region words appearing in the text may be 25, 23, 20, etc. in sequence. The adding and dividing table stored in a key-value (key value pair) form can be obtained by traversing the region words of the whole text information, wherein the key value is the region word and comprises various hierarchies such as province, city, district/county, business district and the like, and the value is the fraction of the text information on the region word. The specific form of the adding and dividing table is as follows: 8.39 in the flood area, 7.93 in Shenyang city, 0 in Liaoning province and 0 in Wudao crossing.

In step 350, according to the hierarchical relationship of the administrative areas, the numerical values corresponding to the regional words belonging to the same administrative area are fused to obtain a regional word numerical value fusion result of the administrative area.

It should be explained that the hierarchical relationship refers to the hierarchical division of the administrative region, for example, there are four levels in Guangdong province (province level), Shenzhen city (city level), Futian district (district level), and North China (business circle). Likewise, other provinces, cities, regions under the district have subordinates, for example, the subordinated region of Beijing city includes the east city region, the west city region, the great happy region, the sunny region, the hai lake region, the Fengtai region, etc., and under each region, there are a plurality of streets, a plurality of towns, etc.

Wherein, the same administrative region includes the same province, the same city, the same district or county. Since different numerical values are configured for the regional words in the text message in step 330, the fusion of the numerical values corresponding to the regional words belonging to the same administrative area may be performed by adding the numerical values corresponding to the regional words of each level belonging to the same administrative area, so as to obtain a regional word numerical value fusion result of the administrative area. That is, the numerical values of the city, the district, the county and the business circle are accumulated layer by layer upwards, and the numerical values of the business circle under the same district/county are accumulated to the numerical values of the district/county to obtain the fusion result of the regional word numerical values under the district/county. And accumulating the numerical values of all the districts/counties under the same city to the city to obtain a fusion result of the regional word numerical values under the city level. And accumulating the numerical values of the same province and each city to the province to obtain a fusion result of the regional word numerical values under the province level. The specific formula is as follows:

wherein λ_i，λ_j，λ_kThe parameters for manual setting are typically in the range of (0, 1). When the above three formulas are calculated, only the calculation with the hierarchical relationship, that is, the district/county and business circle with the hierarchical relationship, the city and district/county with the hierarchical relationship, the province and city with the hierarchical relationship, is calculated, and the calculation without the hierarchical relationship is not calculated.

That is, when calculating the score of "Beijing City", the scores of all administrative units under jurisdiction, such as "east city", "West city", "open-air area", "Haihe district", etc., are added to the score of "Beijing City", but the score of "Futian district" is not added to "Beijing City". For example: score (shenyang city) +0.5 (score (in flood zone) + score (and plateau) + …). When calculating the regional word numerical value fusion result of the administrative region of Shenyang city, only the numerical value of the administrative region under Shenyang city is accumulated to the numerical value of Shenyang city.

And the region word numerical value fusion result of the lowest-level administrative region is a numerical value corresponding to the region word of the lowest-level administrative region. Assuming that the business district is the lowest administrative region, since there is no lower administrative region under the business district, the regional word numerical value fusion result of the level of the business district can be regarded as the numerical value corresponding to the regional word of the business district, for example, the numerical value of the regional word of "north china" can be regarded as the regional word numerical value fusion result of the lowest-level administrative region (i.e., the business district) of "north china" and "north china"; the regional word value of the "middle customs" can be regarded as the regional word value fusion result of the lowest level administrative region (namely the business circle) of the "middle customs".

In step 370, comparing the regional word value fusion results of each administrative region of the same hierarchy layer by layer, determining the administrative region to which the text message belongs, and obtaining the regional attribute of the text message.

Wherein, the same level means belonging to the same level in the administrative region division, all provincial administrative regions can belong to the same level, and all city administrative regions can belong to the same level. For example, 23 provinces such as Guangdong province, Jiangsu province, Anhui province, Zhejiang province, etc. may be of the same level, namely, province level. Shenzhen city, Hangzhou city, Hefei city and the like can be in the same level, namely the market level. The Futian area, the Nanshan area, the Baoan area and the like can be in the same level, namely in the zone level. The administrative region may be a province, a city, a district, or a business district, and the administrative region with the largest regional word value fusion result at the same hierarchical level may be considered as the administrative region to which the text message belongs. The region attribute refers to the place of occurrence of the news information described in the text information, and may include province, city, district, business district, etc.

After the regional word value fusion results of each administrative region are calculated in step 350, the regional word value fusion results of each administrative region of the same level can be compared layer by layer. That is, the fusion results of the provincial administrative regions, for example, the fusion results of the regional word values of 23 provinces such as Guangdong province, Jiangsu province, Anhui province, Zhejiang province, etc., are compared, and the province with the highest score is determined as the provincial regional attribute of the text message. And other provincial relevance scores are cleared. And confirming the city with the highest score as the city-level region attribute of the text information in the province with the highest score. If there is no market score, the market region result is null. And confirming that the district/county with the highest score is the district/county level regional attribute in the city with the highest score, and if the district/county level score does not exist, the result of the district/county level regional attribute is null. And confirming that the business district result with the highest score is the business district grade regional attribute in the district/county with the highest score, and if no business district score exists, the business district grade regional result is empty. Therefore, the content of the text information can be determined to which province, city, district or even business district the content of the text information belongs to, and the regional attribute of the text information is obtained.

According to the technical scheme provided by the exemplary embodiment of the invention, for regional text information, different numerical values are configured according to the sequence positions of regional words in the text information, and the numerical values corresponding to the regional words in the same administrative region are fused according to the hierarchical relationship of the administrative region, so that the regional word numerical value fusion result of the administrative region is obtained, and the administrative region to which the text information belongs can be determined by comparing the regional word numerical value fusion results of the administrative regions in the same hierarchy layer by layer, so that the regional attribute of the text information is obtained. Therefore, even if the text information comprises a plurality of region words, the region attributes of the text information can be accurately identified by configuring numerical values for the region words and fusing the numerical values of the region words, and further personalized recommendation of the text information can be realized.

In an exemplary embodiment, as shown in fig. 4, before step 310, the method further includes:

in step 301, obtaining sample information known whether regional sample information exists, and performing word segmentation processing on the sample information to obtain a plurality of phrases;

the sample information refers to a large amount of text information known whether to have a regional property, and is referred to as sample information here for distinguishing from the text information to be recognized. The word segmentation processing means that the sample information is divided into words, and each word can be called a word group. It should be noted that the word segmentation method can be implemented by using the existing word segmentation method, and is not described herein again.

In step 302, mapping a word vector corresponding to each phrase in a semantic space to obtain a plurality of word vectors corresponding to the sample information;

mapping the word vector corresponding to each phrase in the semantic space means that each phrase is converted into a vector, and for similar phrases, the corresponding word vectors are also similar. The semantic space may be considered a multidimensional space, where each phrase belongs to a point in the space, and the closer the similar phrases are in the space. There are many ways to generate word vectors, all according to one idea: the meaning of any word may be expressed by its surrounding words. The way of generating the word vector can be divided into: statistical-based methods and language model (language model) based methods can be referred to in the art. Therefore, a series of phrases in the sample information can obtain a corresponding series of word vectors.

In step 303, a text classification model is trained by the multiple word vectors of the sample information, so as to obtain the region judgment model.

The text classification model may be a fast text model whose input is a series of word vectors and whose output is the probability of a series of word vectors over a given category. On the task of region identification, the corresponding task is to determine whether the sample information is a two-category classification problem of a region article or not for given sample information. As shown in fig. 5, compared with the LR model, the Fast text model structure diagram omits the conventional process of manually designing a feature template, and deepens semantic information by mapping words into word vectors. W in the figure_iAnd the ith word after the word segmentation is carried out on the sample information. The Hidden layer is an intermediate layer of the model, the Output layer is an Output layer of the model, and because whether the known sample information has regionality or not, the result of the Output layer belongs to a known quantity, so that the result of the Output layer is more accurate by adjusting the parameters of the intermediate layer, and the model can be a text classification model after parameter optimization through parameter regional judgment.

In another exemplary embodiment, as shown in fig. 6, before step 310, the method provided by the present invention may further include:

in step 601, obtaining sample information known whether regional sample information exists, and extracting feature data of the sample information;

similarly, the sample information refers to a large amount of text information known whether or not it is regional. The characteristic data is used for representing characteristics of sample information, and the characteristics comprise text word characteristics, region word characteristics, text length discretization characteristics and the like. The text word characteristics refer to an effective word sequence obtained by filtering stop words after word segmentation processing is carried out on the sample information. The regional word feature is a feature that whether each regional word appears in the sample information is judged according to an existing administrative regional word list (including province-city-district/county). The text length discretization feature refers to the interval discretization of the length of the text of the sample information, such as the number of participles of the text of the sample information is 0-200,201 and 400.

In step 602, a logistic regression model or a gradient lifting decision tree model is trained according to the feature data of the sample information, so as to obtain the region judgment model.

The LR model, i.e., the logistic regression model, is a relatively common machine learning classification method, and its calculation formula is as follows.

Wherein, x is the characteristic data extracted according to the characteristic template for the sample information, w is the weight of the model on each characteristic dimension, namely the parameter to be solved, w^TIndicating transposition. y1 indicates regionality, and y 0 indicates no regionality. All the constructed feature data and corresponding label values (with the regionality of 1, without the regionality of 0) are used as training sets to train the logistic regression model, model parameters w and b can be estimated through a maximum likelihood estimation method, a gradient descent method, a random gradient descent method, a quasi-Newton method and the like (the methods belong to the prior art and are not described herein again), and the obtained parameters w and b are substituted into the formula to obtain the region judgment model.

Similarly, the gbdt (gradient Boosting Decision tree) model, i.e., the gradient Boosting Decision tree, is an iterative Decision tree algorithm. The model formula is as follows:

wherein x is input data, h is a classification regression tree obtained by each iteration, w is a parameter corresponding to the regression tree, and α is a weight corresponding to each regression tree. In the GBDT-based region identification algorithm, the characteristics used by the GBDT model are the same as those used by the LR model, and are text word characteristics, region word characteristics, text length discretization characteristics and the like. From the actual on-line expression, the GBDT model is also basically similar to the LR model, all the constructed feature data (as input data) and corresponding label values are used as a training set, the weight α and the parameter w of each tree can be estimated by methods such as gradient descent, and the region judgment model can be obtained by substituting the obtained parameter w and the obtained weight α into the above formula.

The regional judgment model obtained by training the models such as LR, fast text, GBDT and the like mainly judges whether text information such as news and the like has regionality according to the text of the text information, and if the model prediction result is a non-regional article, the result can be directly obtained. And if the model prediction result is a region article, performing region word value fusion by configuring values for region words and setting according to the hierarchical relationship of regions, and determining the administrative region to which the text information belongs to obtain the region attribute of the text information.

In an exemplary embodiment, after the step 310, the method provided by the present invention may further include:

and when the text information is judged to be regional, performing ambiguity resolution processing on ambiguous words in the text information, and determining regional words in the text information.

It should be noted that the ambiguous word refers to a word having at least two semantics. There are two kinds of ambiguity of regional words, one is ambiguity between regional words and common words, for example, "silver" is short for Baiyin city in Gansu province, and is also a metal. The other is ambiguity between regional words and regional words, for example, "Chaoyang" is short for the Yang-ward region in Beijing, Changchun city in Jilin province, and Chaoyang town in Xinghua city in Jiangsu province.

The ambiguity resolution processing of ambiguous words in the text information refers to recognition of regional words and common words of vocabularies (such as silver) which possibly represent regions, and recognition of regional words (such as sunward) which can represent multiple regions. When ambiguous words such as "sun facing" and "silver" appear in the text message, simple regional thesaurus matching often causes some low-level errors because ambiguous relationships between regional words and common words or regional words cannot be found. The disambiguation work can well solve the problem, can accurately obtain the regional words in the text information, and confirms the real semantics of the regional words.

In an exemplary embodiment, the step of "performing disambiguation processing on ambiguous words in the text information and determining regional words in the text information" specifically includes:

judging whether the ambiguous words are regional words or not through a conditional random field model according to the context of the ambiguous words;

and when the ambiguous word is judged to be the region word, determining the unique semantics of the region word according to the region information which is related to the region word and appears in the text information.

Context refers to both the above and below of an ambiguous word. Firstly, judging whether the ambiguous word is a region word according to the context information of the ambiguous word, and disambiguating the region word and the common word by the step. Specifically, whether the ambiguous word is a region word or a common word is judged through a Conditional random field model (CRF). The conditional random field model is a discriminant undirected graph model and is generally used for sequence labeling tasks such as word segmentation, part of speech labeling, named entity recognition and the like in natural language processing. The goal of the conditional random field model is to build a conditional probability model P (y | x). In a chained conditional random field, the conditional probability is defined as the following equation:

wherein, t_j(y_i+1,y_iX, i) is the transfer characteristic function at two adjacent marker positions of the observation sequence, s_k(y_iX, i) is a state feature function at marker position i of the observation sequence. x is an observation sequence, i.e. x₁,x₂,x₃…, y is a state sequence, i.e. y1, y2, y3 …

Taking part-of-speech tagging tasks as examples:

x is: drama department of Chinese culture university began a new evening

y is: NS N N N V U A N

Wherein NS is a place name in the part of speech, N is a noun in the part of speech, V is a verb in the part of speech, U is a co-word in the part of speech, and A is an adjective in the part of speech.

In the domain word disambiguation task based on the CRF, x is an input segmentation sequence with ambiguous words, and Y is a domain word prediction sequence, namely, whether the current domain word is the domain word or a segment of the domain word is predicted.

λ_j，μ_kThe parameters are weights corresponding to the respective feature functions, and after training is completed, the parameters belong to known quantities.

The regional words and the common words are disambiguated through the conditional random field model, for example, if the label of the ' silver ' in the prediction sequence output by the conditional random field model is the regional word, the ' silver ' in the text is mapped to the ' silver city in Gansu province, otherwise, the ' silver ' is regarded as the common word of the non-regional word.

After the disambiguation of the regional words and the common words is completed, the disambiguation of the regional words and the regional words needs to be solved. The disambiguation between the regional words is mainly determined by using whether the regional information related to the ambiguous regional words appears in the text information, and the related regional information can be province, city, district/county and the like corresponding to the ambiguous regional words. For example, in the disambiguation process for "facing sun", if "beijing city" appears in the text information, it may be determined that "facing sun" in the current text information refers to "beijing city facing sun zone". If "Changchun City" or "Jilin province" appears in the text message, "the" facing the sun "in the current text message may be determined to mean" Changchun City facing the sun district in Jilin province ". Therefore, the unique semantics of the region words can be accurately identified. And when the subsequent regional word numerical value is fused, the administrative region to which each regional word belongs can be determined, so that the regional word fusion result of each administrative region is more accurate.

In an exemplary embodiment, the step 330 specifically includes:

and when the text information is judged to be regional, sequentially configuring numerical values from large to small for the regional words in the text information according to the sequence positions in the text information.

It should be explained that the region word configuration value in the text information may be considered as scoring each region word, and the score is higher and lower according to the position of the region word in the text information, the more front region word score is higher, and conversely, the more rear region word score is lower. Whereby the geoword score appearing in the title is highest. Further, after disambiguating the territorial word, the territorial words that have been successfully disambiguated may be scored. Due to the hierarchy of the regional information: province-city-district-county-business circles, which may be given a bonus to their corresponding levels. And then, fusing the regional word scoring results corresponding to the text information, thereby giving a final result of 'province-city-district-county-business circle'.

For a text message containing regional words such as Touchi city, the corresponding regional attributes are correctly found. In the stage of hierarchy scoring and fusing, the regional words appearing at different positions in the text information are firstly distinguished, and different scoring weights are given to the regional words in the title, the media name and the text. And for the region words which are all in the text, the given scoring weight is also subjected to score attenuation along with the positions of the region words in the text, so that the higher the scoring given to the region words which are more head is ensured.

In an exemplary embodiment, the step 350 specifically includes:

and according to the hierarchical relation of the administrative regions, accumulating the numerical values corresponding to the regional words belonging to the same administrative region to obtain a regional word numerical value fusion result of the administrative region.

For example, suppose Beijing City includes Hai lake district, Fengtai district, and Daxing district. The Hai lake area comprises an A business circle, a B business circle and a C business circle, the Fengtai area comprises a D business circle and an E business circle, and the Daxing area comprises an F business circle, a G business circle and an H business circle. For convenience of description, a is used to represent the numerical value configured by the quotient circle a, B is used to represent the numerical value configured by the quotient circle B, and so on. The values assigned to the Haihai district are denoted by x, the values assigned to the Fengtai district are denoted by y, the values assigned to the Haicheng district are denoted by z, and the values assigned to the Beijing City are denoted by w. Therefore, the regional word numerical fusion result of the administrative region of the Hai lake region can be x + (A + B + C), the Fengcai region is y + (D + E), and the Daxing region is z + (F + G + H). The regional word numerical fusion result of Beijing city in this administrative region may be w + x + (A + B + C) + y + (D + E) + z + (F + G + H). That is, the numerical values corresponding to the regional words in the same administrative region are accumulated to obtain the regional word numerical value fusion of the administrative region. Similarly, when there are a plurality of regional words, other provinces can be accumulated in the same manner, and the province where the content described in the text message occurs is determined by comparing the sizes of the fusion results. After the province is determined, it is further determined which city the contents described in the text message occur in, and so on, based on the fusion result corresponding to each city in the province.

In an exemplary embodiment, as shown in fig. 7, the step 370 specifically includes:

in step 371, comparing the regional word numerical value fusion results of each administrative region of the same level layer by layer from the high level to the mapped low level, and screening out the administrative region with the maximum fusion result under the same level layer by layer;

it should be noted that the high level and the low level are relative concepts, the provincial level belongs to the high level relative to the city level, the city level is the low level, the city level is the high level relative to the district level, the district level is the low level, and so on. The lower layer mapped with the upper layer refers to an administrative region belonging to the lower layer, for example, the upper layer is Guangdong province, the lower layer mapped with the facies is Shenzhen city, the upper layer is Shenzhen city, and the lower layer mapped with the facies is Futian region. As described above, the same hierarchy means that all administrative regions belong to the same level, all provincial administrative regions may belong to the same level, and all urban administrative regions may belong to the same level. The method can be divided into four levels according to province, city, district and business district. Each administrative area of the same hierarchy includes provinces, direct municipalities, municipalities and municipalities belonging to the same province level administrative area, and all municipalities of the next hierarchy may be considered as each administrative area belonging to the same hierarchy. Similarly, all the areas of the next level belong to the same level of administrative regions.

For example, the beijing, guangdong and zhejiang provinces belong to the same level, and the administrative region with the largest fusion result is found by comparing the regional word value fusion result of the beijing, the regional word value fusion result of the guangdong and the regional word value fusion result of the zhejiang province. Comparing the regional word numerical value fusion results of all administrative regions of the same level layer by layer, and screening out the administrative region with the maximum fusion result under the same level layer by layer, wherein the administrative regions of all provinces are compared firstly, the province with the maximum fusion result is screened out, then all administrative regions of the city level under the province are compared, the city with the maximum fusion result is screened out, then all administrative regions of the city level under the city are compared, the region with the maximum fusion result is screened out, and the like.

And if at least two administrative regions with the maximum fusion result exist in any hierarchy, screening the administrative regions from the hierarchy is not performed. For example, after the province with the largest region word numerical value fusion result is determined, if at least two cities with the largest region word numerical value fusion result exist in the province, the city with the largest region word numerical value fusion result may not be selected any more, but only the province with the largest region word numerical value fusion result, which is the last level, is determined as the region attribute of the text message.

In step 372, according to the administrative region with the maximum fusion result at each level, the administrative region to which the text message belongs is determined, and the regional attribute of the text message is obtained.

For example, the province-level administrative area with the largest fusion result is zhejiang province, a city with the largest fusion result can be found according to the regional word value fusion result of each city in zhejiang province, the region with the largest fusion result can be found according to the regional word value fusion result of each region in hangzhou city, the region with the largest fusion result can be found according to the regional word value fusion result of each business circle in the coastal river region, the business circle with the largest fusion result can be found, after the administrative area with the largest fusion result of each level is found, the province, the city, the region and the business circle to which the text information belongs can be determined, the content recorded in the text information can be considered to belong to the event occurring in the province, the city, the region and the business circle, and the regional attribute of the text information is obtained as the province, the city, the region and the business circle.

Further, the step 370 further includes: and if the region word numerical value fusion result of each administrative region of the current level does not exist in any lower layer, the administrative regions are not screened from the current level.

For example, if there is no corresponding region word value fusion result in each city-level administrative region of an province, the city-level administrative regions of the province may not be screened, that is, the city-level regional attribute may be null. Similarly, if each region of a certain city does not have a corresponding region word numerical value fusion result, the administrative region is not screened from the region level, and the region attribute of the region level is null.

On the basis of the above exemplary embodiment, the technical solution provided by the present invention may further include:

the occurrence frequency of each region word and the occurrence frequency of each region table token in the text information are used as input, and the probability that the text information belongs to each administrative region of each level is output through a pre-constructed full-connection model;

the fully-connected model refers to a fully-connected neural network model, and the fully-connected model can be obtained through training of a large amount of sample data. The regional representation words comprise My province, My city, Ben city and the like. And corresponding to each region word in the region word list, and corresponding occurrence frequency is arranged in the text information. And each geographic table testimonial word also has corresponding occurrence frequency.

Specifically, the occurrence frequency of each regional word and the occurrence frequency of each regional list testword in the regional word list can be used as the input of the full-connection model, and the probability that the text information belongs to each province, each city, each district/county and each business district is output through the full-connection model.

From the high level to the low level of the mapping, carrying out weighted addition on the region word numerical value fusion result of each level of administrative region and the corresponding probability of the administrative region, and screening out the high level with the maximum weighted addition result and the low level of the mapping;

for example, the lower level from the upper level to the facies mapping can be Guangdong province, Shenzhen City, Futian district, AA business district. The weighted addition of the regional word numerical value fusion result of each hierarchical administrative region and the corresponding probability of the administrative region can be the weighted addition of the regional word numerical value fusion result of Guangdong province and the probability of Guangdong province, the regional word numerical value fusion result of Shenzhen city and the probability of Shenzhen city, the regional word numerical value fusion result of Futian district and the probability of Futian district, and the regional word numerical value fusion result of AA Shandong and the probability of AA Shandong.

Therefore, all province-city-district/county-business circle combinations have corresponding weighted addition results, and the province-city-district/county-business circle combination with the largest weighted addition result, namely a high-level and a mapped low-level are obtained.

And checking the region attribute of the text information through the screened high layer and the mapped low layer.

The province-city-district/county-business circle combination with the largest weighted addition result can be regarded as the province-city-district/county-business circle to which the text information belongs, and the regional attribute of the text information obtained in step 370 can be checked. If the text information region attribute is not consistent with the text information region attribute obtained in step 370, the province-city-district/county-business district combination with the largest final weighted addition result can be used as the text information region attribute. According to the requirement, the identification accuracy of the two region attribute identification modes can be tested through experiments, and the administrative region to which the text information belongs, which is determined through the mode with higher identification accuracy, is used as the region attribute of the text information.

Fig. 8 is a detailed flowchart of a text information region attribute identification method according to an exemplary embodiment of the present invention. As shown in fig. 8, the method may specifically include the following steps:

in step 801, the server receives a request for geographical attribute identification, which carries news text.

In step 802, the server performs word segmentation processing on the news text to obtain a series of phrases.

In step 803, the server determines whether the news text is regional by using the established regional determination model through the series of phrases.

In step 804, when the news text is regional, the regional word in the news text is disambiguated.

In step 805, the disambiguated regional words are scored according to the positions in the text, and the numerical values corresponding to the regional words in the same administrative area are fused according to the hierarchical relationship setting, so as to obtain the regional word numerical value fusion result of the administrative area.

In step 806, the regional word value fusion results of each administrative area of the same hierarchy are compared layer by layer, the administrative area to which the news text belongs is determined, and the regional attribute of the news text is obtained.

On the contrary, when the step 803 determines that the news text does not have the geographic property, the result that the news text does not have the geographic property can be directly obtained.

In step 807, a region attribute identification result of the news text is returned to the requester.

Fig. 9 is a flow chart illustrating the implementation of personalized text information recommendation by using the present invention. As shown in fig. 9, in step 901, a news text is input, and the server runs a region identification system, which identifies the region attributes of the news text, including province, city, district, business district, etc., by using the method provided by the present invention.

Meanwhile, in step 902, LBS (Location Based Service) attributes of the user, that is, Location information of the user, are acquired.

Further, in step 903, the server may operate the recommendation system to match the region attribute of the news text with the location information of the user, so as to obtain news recommended to the user in the region slot, that is, a news recommendation result.

The following is an embodiment of the apparatus of the present invention, which can be used to execute an embodiment of the method for identifying the region attribute of the text message executed by the server 110 according to the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method for identifying a region attribute of text information of the present invention.

Fig. 10 is a block diagram illustrating a text information region attribute recognition apparatus according to an exemplary embodiment, which may be used in the server 10 in the implementation environment shown in fig. 1 to execute all or part of the steps of the text information region attribute recognition method shown in any one of fig. 3, 4, and 6-9. As shown in fig. 10, the apparatus includes, but is not limited to: a regional judgment module 1010, a value configuration module 1030, a value fusion module 1050, and a region determination module 1070.

The regional judgment module 1010 is used for performing regional judgment on the text information to be recognized through the established regional judgment model;

a value configuration module 1030, configured to configure different values for the region words in the text information according to the sequence positions in the text information when it is determined that the text information has regionality;

the numerical value fusion module 1050 is configured to fuse numerical values corresponding to regional words belonging to the same administrative area according to a hierarchical relationship of the administrative area, so as to obtain a regional word numerical value fusion result of the administrative area;

the region determining module 1070 is configured to compare region word value fusion results of each administrative region of the same hierarchy layer by layer, determine an administrative region to which the text information belongs, and obtain a region attribute of the text information.

The implementation processes of the functions and actions of the modules in the device are specifically described in the implementation processes of the corresponding steps in the text information region attribute identification method, and are not described herein again.

The regional judgment module 1010 can be, for example, one of the physical structure cpus 222 in fig. 2.

The value configuration module 1030, the value fusion module 1050, and the region determination module 1070 may also be functional modules, and are configured to execute corresponding steps in the text information region attribute identification method. It is understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as programs stored in memory 232 for execution by central processor 222 of FIG. 2.

In an exemplary embodiment, the apparatus further comprises:

the word segmentation module is used for acquiring known regional sample information and performing word segmentation processing on the sample information to obtain a plurality of word groups;

the word vector construction module is used for mapping a word vector corresponding to each word group in a semantic space to obtain a plurality of word vectors corresponding to the sample information;

and the model construction module is used for training a text classification model through a plurality of word vectors of the sample information to obtain the region judgment model.

In an exemplary embodiment, the apparatus further comprises:

the data acquisition module is used for acquiring known regional sample information and extracting the characteristic data of the sample information;

and the model training module is used for training a logistic regression model or a gradient lifting decision tree model through the characteristic data of the sample information to obtain the region judgment model.

In an exemplary embodiment, the apparatus further comprises:

and the region disambiguation module is used for performing ambiguity resolution processing on ambiguous words in the text information and determining region words in the text information when the text information is judged to have regionality.

In an exemplary embodiment, the zone disambiguation module comprises:

the region word judging unit is used for judging whether the ambiguous word is the region word or not through a conditional random field model according to the context of the ambiguous word;

and the region word semantic determining unit is used for determining the unique semantic of the region word according to the region information which is related to the region word and appears in the text information when the ambiguous word is judged to be the region word.

In an exemplary embodiment, the value configuration module 1030 includes:

In an exemplary embodiment, the numerical fusion module 1050 includes:

and the numerical value accumulation unit is used for accumulating numerical values corresponding to regional words under the same administrative region according to the hierarchical relationship setting of the administrative region to obtain a regional word numerical value fusion result of the administrative region.

In an exemplary embodiment, the zone determination module 1070 includes:

the layer-by-layer screening unit is used for comparing the regional word numerical value fusion results of each administrative region of the same level layer by layer from a high layer to a mapped low layer, and screening the administrative region with the maximum fusion result under the same level layer by layer;

and the attribute obtaining unit is used for determining the administrative region to which the text information belongs according to the administrative region with the maximum fusion result under each level, and obtaining the regional attribute of the text information.

In an exemplary embodiment, the apparatus further comprises:

the probability calculation module is used for taking the occurrence frequency of each region word and the occurrence frequency of each region table token in the text information as input and outputting the probability that the text information belongs to each administrative region of each level through a pre-constructed full-connection model;

the weighted fusion module is used for carrying out weighted addition on the regional word numerical value fusion result of each level of administrative region and the corresponding probability of the administrative region from the high level to the low level of the mapping, and screening out the high level with the maximum weighted addition result and the low level of the mapping;

and the attribute checking module is used for checking the region attribute of the text information through the screened high layer and the mapped low layer.

Optionally, the present invention further provides an electronic device, which may be used in the server 110 in the implementation environment shown in fig. 1 to execute all or part of the steps of the text information region attribute identification method shown in any one of fig. 3, fig. 4, and fig. 6 to fig. 9. The electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the region attribute identification method for text information according to the above exemplary embodiment.

The specific manner in which the processor of the apparatus in this embodiment performs the operation has been described in detail in the embodiment of the region attribute identification method with respect to the text information, and will not be described in detail here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium stores a computer program executable by a processor to perform the above-described method for identifying a regional attribute of text information.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A region attribute identification method of text information is characterized by comprising the following steps:

comparing the regional word numerical value fusion results of all administrative regions of the same level layer by layer, determining the administrative region to which the text information belongs, and obtaining the regional attribute of the text information;

2. The method according to claim 1, wherein the regional word numerical value fusion result of the lowest-level administrative area is a numerical value corresponding to the regional word of the lowest-level administrative area.

3. The method according to claim 1, wherein before the regionalization of the text information to be recognized by the constructed regional judgment model, the method further comprises:

acquiring sample information known whether regional sample information exists or not, and performing word segmentation processing on the sample information to obtain a plurality of word groups;

mapping a word vector corresponding to each phrase in a semantic space to obtain a plurality of word vectors corresponding to the sample information;

and training a text classification model through a plurality of word vectors of the sample information to obtain the region judgment model.

4. The method according to claim 1, wherein before the regionalization of the text information to be recognized by the constructed regional judgment model, the method further comprises:

acquiring known regional sample information, and extracting feature data of the sample information;

and training a logistic regression model or a gradient lifting decision tree model through the characteristic data of the sample information to obtain the region judgment model.

5. The method according to claim 1, wherein after the regionally determining the text information to be recognized by the established regional determination model, the method further comprises:

6. The method according to claim 5, wherein the disambiguating the ambiguous word in the text information to determine the regional word in the text information when it is determined that the text information is regional comprises:

7. The method according to claim 1, wherein when it is determined that the text message is regional, configuring different numerical values for regional words in the text message according to the position in the text message in sequence, includes:

8. The method according to claim 1, wherein the fusing values corresponding to regional words belonging to the same administrative area according to the hierarchical relationship of the administrative area to obtain a regional word value fusion result of the administrative area comprises:

9. The method according to claim 1, wherein the step of comparing the regional word numerical value fusion results of the administrative regions of the same hierarchy layer by layer to determine the administrative region to which the text message belongs and obtain the regional attribute of the text message comprises:

comparing the regional word numerical value fusion results of each administrative region of the same level layer by layer from a high layer to a mapped low layer, and screening out the administrative region with the maximum fusion result under the same level layer by layer;

and determining the administrative region to which the text information belongs according to the administrative region with the maximum fusion result under each level, and obtaining the regional attribute of the text information.

10. The method according to claim 9, wherein the comparing, layer by layer, the regional word value fusion results of each administrative area of the same hierarchy to determine the administrative area to which the text message belongs and obtain the regional attribute of the text message further comprises:

and if at least two administrative regions with the maximum fusion result exist in any hierarchy, screening the administrative regions from the hierarchy is not performed.

11. The method according to claim 9, wherein the comparing, layer by layer, the regional word value fusion results of each administrative area of the same hierarchy to determine the administrative area to which the text message belongs and obtain the regional attribute of the text message further comprises:

and if the region word numerical value fusion result of each administrative region of the current level does not exist in any lower layer, the administrative regions are not screened from the current level.

12. An apparatus for recognizing a regional attribute of text information, the apparatus comprising:

the region determining module is used for comparing region word value fusion results of all administrative regions of the same level layer by layer, determining the administrative region to which the text information belongs and obtaining the region attribute of the text information;

13. The apparatus of claim 12, wherein the zone determination module comprises:

14. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method for geographical attribute identification of textual information of any of claims 1-11.