CN115169319B

CN115169319B - Data processing system of identification symbol

Info

Publication number: CN115169319B
Application number: CN202210856545.7A
Authority: CN
Inventors: 刘羽; 张正义; 刘宸; 傅晓航
Original assignee: Zhongke Yuchen Technology Co Ltd
Current assignee: Zhongke Yuchen Technology Co Ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2023-02-07
Anticipated expiration: 2042-07-21
Also published as: CN115169319A

Abstract

The invention relates to a data processing system of identification symbols, comprising: a database, a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: acquiring a first text list and a second text list corresponding to any event; acquiring a target triple corresponding to the first text list according to each first text; obtaining a key triple corresponding to the second text according to any second text, when any key component in the key triple is a null set, obtaining the similarity corresponding to the second text, when the similarity corresponding to the second text is not less than a preset similarity threshold, determining that the key component is a component corresponding to a target triple, and when the similarity corresponding to the second text is less than the preset similarity threshold, marking the key component as an abnormal symbol; the meaning of the symbol representation in the text can be known, and the occurrence of events can be accurately known through the content of the network text.

Description

Data processing system of identification symbol

Technical Field

The invention relates to the technical field of entity identification, in particular to a data processing system of an identification symbol.

Background

With the advent of the network age, internet users are more and more actively acquiring network content and participating in the creation of the content, and one important form of the content is based on social media. Social media, as its name suggests, is used for social interaction, and as social users grow, it gradually forms one or more overlapping social networks within it, along which social information can be propagated between users. Generally, a social media user can directly obtain social information sent by a user of interest, and from a graph point of view, the information can be obtained from adjacent users. Although the social network structure is quite complex, according to the six-degree segmentation theory, the diameter of the social network structure is not too large, so that through forwarding of social users, information can break through regional limitation on the social network and can be rapidly spread, and through obtaining social media information, people can obtain events occurring in real life at the fastest speed. However, in social media related text, a symbol may be substituted for a phrase or word, resulting in an event that cannot occur through knowledge of the content.

Disclosure of Invention

To is directed atIn view of the above technical problems, the technical solution adopted by the present invention is a data processing system for recognizing a symbol, the system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: target text set of time a = { a = { a } ₁ ，……，A _i ，……，A _n }，A _i The method refers to a target text list corresponding to the ith event, i =1 … … n, n is the number of events, and when the computer program is executed by a processor, the following steps are realized:

s100, obtaining A _i Corresponding first text list C _i ={C _i1 ，……，C _ix ，……，C _ip }，C _ix The x first text of the ith target event, x =1 … … p, p is the first text number of the ith target event, and A _i Corresponding second text list D _i ={D _i1 ，……，D _iy ，……，D _iq }，D _iy The ith target event is the ith second text, y =1 … … q, and q is the second text number of the ith target event;

s200, according to each C _ix Obtaining C _i Corresponding target triplet C' _i ={C' _i1 ，C' _i2 ，C' _i3 - }, wherein, C' _i1 Is referred to as C _i Of a first target entity, C' _i2 Is referred to as C _i Of a second target entity of, C' _i3 Is C' _i1 And C' _i2 A target relationship therebetween;

s300, according to D _iy Obtaining D _iy Corresponding key triplet H _iy ={H ¹ _iy ，H ² _iy ，H ³ _iy }，H ¹ _iy Is referred to as D _iy First key entity of (1), H ² _iy Is referred to as D _iy Second key entity of (1), H ³ _iy Is referred to as H ¹ _iy And H ² _iy Key relationships between;

s400, when H ^g _iy If null, D is acquired _iy Corresponding similarity F _iy Wherein H is ^g _iy Is H ¹ _iy ，H ² _iy And H ³ _iy Any one of them;

s500, when F _iy When the similarity is more than or equal to a preset similarity threshold value, H is determined ^g _iy =C' _ig Wherein, C' _ig Is C' _i1 ，C' _i2 And C' _i3 Any one of them;

s600, when F _iy If the preset similarity threshold value is less than the preset similarity threshold value, H is set ^g _iy The flag is an abnormal symbol.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By the technical scheme, the data processing system for the identification symbol provided by the invention can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:

a data processing system for recognizing a symbol of the present invention comprises: a database, a processor, and a memory storing a computer program, wherein the database comprises: a target text set of times which, when executed by a processor, implement the steps of: acquiring a first text list and a second text list corresponding to any event; acquiring a target triple corresponding to the first text list according to each first text; obtaining a key triple corresponding to the second text according to any second text, when any key component in the key triple is a null set, obtaining the similarity corresponding to the second text, when the similarity corresponding to the second text is not less than a preset similarity threshold, determining that the key component is a component corresponding to a target triple, and when the similarity corresponding to the second text is less than the preset similarity threshold, marking the key component as an abnormal symbol; the meaning of the symbol representation in the text can be known, and further, the occurrence of events can be accurately known through the content of the text in the network.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a flowchart illustrating steps performed by a data processing system for recognizing a symbol according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given with reference to the accompanying drawings and preferred embodiments of a data processing system for acquiring a target position and its effects.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

This embodiment provides a data processing system for recognizing symbols, where the system includes: a database, a processor, and a memory storing a computer program, wherein the database comprises: goal of time text set a = { a = ₁ ，……，A _i ，……，A _n }，A _i Referring to a target text list corresponding to the ith event, i =1 … … n, where n is the number of events, when the computer program is executed by a processor, the following steps are implemented, as shown in fig. 1:

s100, obtaining A _i Corresponding first text list C _i ={C _i1 ，……，C _ix ，……，C _ip }，C _ix The x first text of the ith target event, x =1 … … p, p is the first text number of the ith target event, and A _i Corresponding second text list D _i ={D _i1 ，……，D _iy ，……，D _iq }，D _iy The ith second text of the ith target event is referred to, y =1 … … q, and q is the second text number of the ith target event.

Specifically, the first text and the second text are acquired in the step S100 by:

s101, obtaining A from a database _i ={A _i1 ，……，A _ij ，……，

}，A _ij J =1 … … m, which is the j-th target text corresponding to the ith event _i ，m _i The number of target texts corresponding to the ith event is referred to.

S103, pair A _ij Performing word segmentation to obtain A _ij Corresponding target word character string B _ij ={B ¹ _ij ，……，B ^r _ij ，……，B ^Sj _ij }，B ^r _ij Means A _ij The corresponding r-th target word, r =1 … … Sj, sj means a _ij The number of corresponding target words; wherein, any word segmentation processing method in the field falls into the protection scope of the invention.

S105, when B ^r _ij When not a symbol, determine A _ij As a first text, it can be understood that: only words and/or phrases are present in the first text.

S107, when B ^r _ij When it is a symbol, determine A _ij As a second text, it can be understood that: there are symbols in the second text and the symbols cannot know their meaning.

Specifically, the target text refers to text representing a target event, and the target event refers to an event focused by a user.

Specifically, the target text is text of unofficial media, and preferably, the target text is twitter text.

S200, according to each C _ix Obtaining C _i Corresponding target triplet C' _i ={C' _i1 ，C' _i2 ，C' _i3 Wherein, C' _i1 Is referred to as C _i Of a first target entity of, C' _i2 Is referred to as C _i Of a second target entity of, C' _i3 Is C' _i1 And C' _i2 Target relationship between them.

Specifically, the step S200 further includes the steps of:

s201, obtaining C _ix Corresponding intermediate triplet C' _ix ={C ¹ _ix ，C ² _ix ，C ³ _ix In which C ¹ _ix Is referred to as C _ix First intermediate entity of, C ² _ix Is referred to as C _ix A second intermediate entity of, C ³ _ix Is referred to as C ¹ _ix And C ² _ix The intermediate relationship between the triplets, any method for obtaining the triplets in the art falls within the scope of the present invention, and is not described herein again.

S203 according to all C' _ix Obtaining C _i Corresponding first data list G ¹ _i ={C ¹ _i1 ，……，C ¹ _ix ，……，C ¹ _ip }、C _i Corresponding second data list G ² _i ={C ² _i1 ，……，C ² _ix ，……，C ² _ip And C _i Corresponding third data list G ³ _i ={C ³ _i1 ，……，C ³ _ix ，……，C ³ _ip }。

S205, according to G ¹ _i 、G ² _i And G ³ _i Obtaining C' _i 。

Further, the step S205 further includes the steps of:

s2051, pair G ¹ _i Processing and obtainingG ¹ _i Corresponding first designation list Q ¹ _i= {Q ¹ _i1 ，……，Q ¹ _iα ，……，Q ¹ _iβ }，Q ¹ _iα Means G ¹ _i α =1 … … β, β being G ¹ _i A corresponding first specified number of entities; it can be understood that: the first designated entity is a pair G ¹ _i The first intermediate entity after the deduplication processing, any deduplication processing method in the art, falls within the protection scope of the present invention, and is not described herein again.

S2053, traverse Q ¹ _i And Q is ¹ _i The first designated entity with the maximum number is used as a third designated entity; it can be understood that: at the time of obtaining Q ¹ _i While Q can be controlled ¹ _iα Counting the corresponding number, wherein any counting method in the art falls within the protection scope of the invention, and is not described herein again; the third designated entity can be determined quickly.

S2055, according to the preset word list, when Q is ¹ _iα When the corresponding main word is consistent with the main word corresponding to the third appointed entity, Q is added ¹ _iα Replacing with a third designated entity; it can be understood that: according to Q ¹ _iα Obtaining Q from a predetermined entity configuration table ¹ _iα Corresponding main words and according to the third appointed entity, obtaining the main words corresponding to the third appointed entity from a preset entity configuration table, and when the main words of the corresponding main words and the main words of the third appointed entity are consistent, Q is used ¹ _iα The third designated entity is replaced, so that the entity capable of representing the target event can be accurately determined, and the corresponding meaning can be accurately identified; preferably, the predetermined vocabulary is a vocabulary of synonyms and synonyms, wherein the primary word refers to a primary word characterizing a class of synonyms and synonyms.

S2057, obtaining the number ratio K corresponding to the third appointed entity ¹ _i Wherein, K is ¹ _i The following conditions are met:

K ¹ _i =K ¹ _i0 /p，K ¹ _i0 means the total number of third specified entities;

s2059, when K ¹ _i0 When the number of the entities is more than or equal to a preset entity number threshold value, determining that the third specified entity is C' _i1 。

Further, C' _i2 Is obtained from' _i1 The acquisition modes are consistent.

In a specific embodiment, the method further comprises the steps of:

s1, obtaining each C _ix Corresponding first tag list C ⁰ _ix ={C ⁰¹ _ix ，……，C ^0t _ix ，……，C ^0k _ix }，C ^0t _ix Is referred to as C _ix And the corresponding t-th first label, wherein the first label refers to a label of any first text.

S3, obtaining C according to a preset word list ^0t _ix The corresponding main word.

S5, according to all C ⁰ _ix Obtaining C _i Corresponding intermediate tag list E _i= {E _i1 ，……，E _iv ，……，E _iz }，E _iv Is referred to as C _i Corresponding v-th intermediate tag, v =1 … … z, z being C _i A total number of corresponding middle labels, wherein the middle labels are for all C ⁰ _ix Performing de-duplication processing on the first label; any duplication removing method in the art falls within the protection scope of the present invention, and is not described herein in detail.

S7, according to C ⁰ _ix And E _i Obtaining C _i Corresponding similarity list F ⁰ _i ={F ⁰ _i1 ，……，F ⁰ _ix ，……，F ⁰ _ip }，F ⁰ _ix Is C ⁰ _ix Corresponding similarity, wherein F ⁰ _ix The following conditions are met:

F ⁰ _ix =P ₀ z, wherein P ₀ Is referred to as C ⁰ _ix In satisfyC ^0t _ix =E _iv The number of first tags.

S9, traverse F ⁰ _i And when F ⁰ _ix Is F ⁰ _i At medium maximum similarity, F is determined ⁰ _ix Corresponding to C ³ _ix Is C' _i3 。

In the above, the relationship between the two prepared entities is determined through the tag, the relationship between the entities in different first texts can be unified, and whether the symbols in the texts represent the relationship between the entities or not can be determined continuously.

S300, according to D _iy Obtaining D _iy Corresponding key triplet H _iy ={H ¹ _iy ，H ² _iy ，H ³ _iy }，H ¹ _iy Is referred to as D _iy First key entity of (1), H ² _iy Is referred to as D _iy Second key entity of (1), H ³ _iy Is referred to as H ¹ _iy And H ² _iy Key relationships between; any method for obtaining triplets in the art falls within the scope of the present invention, and is not described herein.

S400, when H ^g _iy If null, D is acquired _iy Corresponding similarity F _iy Wherein H is ^g _iy Is H ¹ _iy ，H ² _iy And H ³ _iy Any one of them.

Specifically, F _iy The following conditions are met:

F _iy =F ¹ _iy ×W ₁ +F ² _iy ×W ₂ +F ⁰ _iy wherein, F ¹ _iy And F ² _iy Refers to the similarity between key components, which refers to H _iy Removing H ^g _iy A component other than W ₁ Is F ¹ _iy Corresponding weight value, W ₂ Is F ² _iy Corresponding weight value, F ⁰ _iy Is referred to as D _iy The target similarity of (1).

Further, F ¹ _iy And F ² _iy Any method of obtaining word similarity known in the art may be used, e.g., F ¹ _iy The following conditions are met:

wherein NK ^iy _γ Means the gamma bit value, MK 'in the word vector of the key component' _γ Is in C' _i In, with NK ^iy _γ The value of gamma bit in the word vector of the component corresponding to the key component is gamma =1 … … phi, which is the vector dimension in the word vector.

Further, F ² _iy And F ¹ _iy The obtaining methods are consistent, and are not described herein again.

Preferably, W ₁ =W ₂ And the inaccuracy of similarity caused by the position reversal of the two entities can be avoided.

Further, F ⁰ _iy The following conditions are met:

F ⁰ _iy =P' _y /P _y wherein, P' _y Is D _iy The second label and E in the corresponding second label list _i Number of labels in the middle of the inner, P _y Is D _iy A total number of second tags in the corresponding second tag list.

As described above, the meaning of the symbol can be determined by the similarity of the tag and the similarity of other elements in the triplet, and the event occurring can be further accurately known through the content of the network text.

S500, when F _iy When the similarity is more than or equal to a preset similarity threshold value, H is determined ^g _iy =C' _ig Wherein, C' _ig Is C' _i1 ，C' _i2 And C' _i3 Any one of them.

Specifically, C' _ig And H ^g _iy Element types in corresponding triplets are consistent, e.g. when H ^g _iy Is the first key entity, C' _ig Is a first target entity.

The embodiment provides a data processing system for recognizing symbols, which includes: a database, a processor, and a memory storing a computer program, wherein the database comprises: a target text set of times which, when executed by a processor, implement the steps of: acquiring a first text list and a second text list corresponding to any event; acquiring a target triple corresponding to the first text list according to each first text; obtaining a key triple corresponding to the second text according to any second text, when any key component in the key triple is a null set, obtaining the similarity corresponding to the second text, when the similarity corresponding to the second text is not less than a preset similarity threshold, determining that the key component is a component corresponding to a target triple, and when the similarity corresponding to the second text is less than the preset similarity threshold, marking the key component as an abnormal symbol; the meaning of the symbol representation in the text can be known, and the occurrence of events can be accurately known through the content of the network text.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data processing system for recognizing symbols, said system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: target text set of events a = { a = { a = } ₁ ，……，A _i ，……，A _n }，A _i The method refers to a target text list corresponding to the ith event, i =1 … … n, n is the number of events, and when the computer program is executed by a processor, the following steps are realized:

s100, obtaining A _i Corresponding first text list C _i ＝{C _i1 ，……，C _ix ，……，C _ip }，C _ix The x first text of the ith target event, x =1 … … p, p is the first text number of the ith target event, and A _i Corresponding second text list D _i ＝{D _i1 ，……，D _iy ，……，D _iq }，D _iy The method includes that the method is the y second text of the ith target event, y =1 … … q, and q is the second text number of the ith target event, wherein the first text and the second text are obtained in the step of S100 through the following steps:

s101, obtaining A from a database _i ＝{A _i1 ，……，A _ij ，……，A _imi }，A _ij J =1 … … m is the j target text corresponding to the ith event _i ，m _i The number of target texts corresponding to the ith event is referred to;

s103, pair A _ij Performing word segmentation to obtain A _ij Corresponding target word character string B _ij ＝{B ¹ _ij ，……，B ^r _ij ，……，B ^Sj _ij }，B ^r _ij Means A _ij The corresponding r < th > target word, r =1 … … Sj, sj means A _ij The number of corresponding target words;

s105, when B ^r _ij When not a symbol, determining A _ij Is a first text;

s107, when B ^r _ij When it is a symbol, determining A _ij The second text is the first text;

s200, according to each C _ix Obtaining C _i Corresponding target triplet C' _i ＝{C' _i1 ，C' _i2 ，C' _i3 Wherein, C' _i1 Is referred to as C _i First object of (2)Entity, C' _i2 Is referred to as C _i Of a second target entity of, C' _i3 Is C' _i1 And C' _i2 Wherein, the step S200 further comprises the following steps:

s201, obtaining C _ix Corresponding intermediate triplet C' _ix ＝{C ¹ _ix ，C ² _ix ，C ³ _ix In which C is ¹ _ix Is referred to as C _ix First intermediate entity of, C ² _ix Is referred to as C _ix A second intermediate entity of, C ³ _ix Is referred to as C ¹ _ix And C ² _ix The intermediate relationship between the two or more of them,

s203 according to all C' _ix Obtaining C _i Corresponding first data list G ¹ _i ＝{C ¹ _i1 ，……，C ¹ _ix ，……，C ¹ _ip }、C _i Corresponding second data list G ² _i ＝{C ² _i1 ，……，C ² _ix ，……，C ² _ip And C _i Corresponding third data list G ³ _i ＝{C ³ _i1 ，……，C ³ _ix ，……，C ³ _ip }；

S205, according to G ¹ _i 、G ² _i And G ³ _i Obtaining C' _i ；

S300, according to D _iy Obtaining D _iy Corresponding key triplet H _iy ＝{H ¹ _iy ，H ² _iy ，H ³ _iy }，H ¹ _iy Is referred to as D _iy First key entity of (1), H ² _iy Is referred to as D _iy Second key entity of (1), H ³ _iy Is referred to as H ¹ _iy And H ² _iy Key relationships between;

s500, when F _iy When the similarity is more than or equal to a preset similarity threshold value, H is determined ^g _iy ＝C' _ig Wherein, C' _ig Is C' _i1 ，C' _i2 And C' _i3 Any one of them;

2. The data processing system for identification symbols of claim 1, further comprising the step of, in the step S205:

s2051, pair G ¹ _i Processing to obtain G ¹ _i Corresponding first designation list Q ¹ _i＝ {Q ¹ _i1 ，……，Q ¹ _iα ，……，Q ¹ _iβ }，Q ¹ _iα Means G ¹ _i α =1 … … β, β being G ¹ _i Corresponding first designated entity quantity, wherein the first designated entity is G ¹ _i The first intermediate entity after the de-duplication process,

s2053, traverse Q ¹ _i And Q is ¹ _i The first designated entity with the maximum number is used as a third designated entity;

s2055, according to the preset entity configuration table, when Q is ¹ _iα When the corresponding body is consistent with the body corresponding to the third appointed entity, Q is added ¹ _iα Replacing with a third designated entity;

K ¹ _i ＝K ¹ _i0 /p，K ¹ _i0 refers to the total number of third specified entities;

s2059, when K ¹ _i0 Not less than predetermined fruitWhen the body number is a threshold value, determining that the third designated entity is C' _i1 。

3. Data processing system for identification symbols according to claim 1, characterised in that C' _i2 Is obtained from' _i1 The acquisition modes are consistent.

4. The data processing system for identification symbols of claim 1, wherein F is _iy The following conditions are met:

F _iy ＝F ¹ _iy ×W ₁ +F ² _iy ×W ₂ +F ⁰ _iy wherein F is ¹ _iy And F ² _iy Is indicated at H _iy Removing H ^g _iy Other than W ₁ Is F ¹ _iy Corresponding weight value, W ₂ Is F ² _iy Corresponding weight value, F ⁰ _iy Is referred to as D _iy The target similarity of (1).

5. The data processing system for identification symbols of claim 4 wherein W is ₁ ＝W ₂ 。

6. Data processing system for identification symbols according to claim 1, characterised in that C' _ig And H ^g _iy The element types in the corresponding triples are consistent.

7. The data processing system for identification symbols of claim 1, wherein said target text is text of an unofficial media.