CN112329458A - New organization descriptor recognition method and device, electronic device and storage medium - Google Patents

New organization descriptor recognition method and device, electronic device and storage medium Download PDF

Info

Publication number
CN112329458A
CN112329458A CN202010435003.3A CN202010435003A CN112329458A CN 112329458 A CN112329458 A CN 112329458A CN 202010435003 A CN202010435003 A CN 202010435003A CN 112329458 A CN112329458 A CN 112329458A
Authority
CN
China
Prior art keywords
word
binary
spliced
preset
organization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010435003.3A
Other languages
Chinese (zh)
Other versions
CN112329458B (en
Inventor
彭涛
杜晶
刘孔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mingyi Technology Co ltd
Original Assignee
Beijing Mingyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mingyi Technology Co ltd filed Critical Beijing Mingyi Technology Co ltd
Priority to CN202010435003.3A priority Critical patent/CN112329458B/en
Publication of CN112329458A publication Critical patent/CN112329458A/en
Application granted granted Critical
Publication of CN112329458B publication Critical patent/CN112329458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a new organization descriptor recognition method and apparatus, an electronic device, and a storage medium. One embodiment of the method comprises: acquiring a recent organization description related historical text set; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence set; for each binary concatenated word in the binary concatenated word repository, a recognition operation is performed to determine whether the binary concatenated word is a new organizational descriptor. The embodiment realizes automatic extraction of new organization descriptors in the recent organization description related historical text set.

Description

New organization descriptor recognition method and device, electronic device and storage medium
Technical Field
The disclosure relates to the technical field of computers, in particular to a new organization descriptor identification method and device, an electronic device and a storage medium.
Background
At present, new organization descriptors in recently generated texts are basically extracted manually, the cost of required manpower and time is high, novel organizations and activities or behaviors thereof cannot be found and processed in time, and hidden dangers are caused to the society. In addition, most texts are described by natural language, the expression mode is seriously spoken and irregular, the manual extraction difficulty is high, and the learning cost is high in the process of manually extracting new organization descriptors depending on manual experience.
Disclosure of Invention
The disclosure provides a new organization descriptor recognition method and device, an electronic device and a storage medium.
In a first aspect, the present disclosure provides a new organizational descriptor recognition method, including: acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a latest preset organization discovery duration and is related to a description organization; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set; for each binary concatenation word in the binary concatenation word library, executing the following identification operation: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence includes: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the above method further comprises: and adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery period is predetermined by the following period determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, the above two for each binary concatenation lexicon in the above binary concatenation lexiconThe meta-concatenation word, which is used for calculating the word frequency, the degree of freedom and the degree of solidification of the binary-concatenation word based on the target word segmentation sequence set, comprises the following steps: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency P (x) in the target word sequence set1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
Figure RE-GDA0002737402510000021
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-GDA0002737402510000022
Figure RE-GDA0002737402510000023
Free(x)=min(H(Prex),H(Postx))
in a second aspect, the present disclosure provides a new organizational descriptor recognition apparatus, the apparatus comprising: the acquisition unit is configured to acquire a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a recent preset organization discovery duration and is related to a description organization; the first generation unit is configured to perform word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; the second generation unit is configured to generate a binary concatenated word library by using a binary concatenated word formed by two adjacent participles in a target participle sequence in the target participle sequence set; the recognition unit is configured to execute the following recognition operation on each binary concatenation word in the binary concatenation word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence includes: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the above apparatus further comprises: and the adding unit is configured to add each binary spliced word determined as the new organization descriptor in the binary spliced word library into the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery period is predetermined by the following period determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, for each binary concatenated word in the binary concatenated word library, calculating the word frequency, the degree of freedom, and the degree of solidity of the binary concatenated word based on the target word segmentation sequence set includes: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency in the target part-word sequence setP(x1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
Figure RE-GDA0002737402510000041
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-GDA0002737402510000042
Figure RE-GDA0002737402510000043
Free(x)=min(H(Prex),H(Postx))
in a third aspect, the present disclosure provides an electronic device, comprising: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect.
In order to identify new organizational descriptors from recently generated text, applicants have discovered through research that if a binary concatenation word occurs frequently in recent organizational description-related historical text, it is highly likely that the binary concatenation word is a new word that is used to describe the organization. Based on the above findings, the new organization descriptor recognition method and apparatus provided by the present disclosure first obtain a recent organization description related history text set related to a description organization generated within a last preset organization discovery duration. And then, word segmentation processing is carried out on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and a target word segmentation sequence set is generated by using each word segmentation sequence obtained after the word segmentation processing. And then, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set. Then, for each binary spliced word in the binary spliced word library, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set. According to the method for identifying the new organization descriptors, the whole process does not need manual operation, the labor cost and the time cost for finding the new organization descriptors are reduced, and the method can quickly identify the new organization descriptors from a large amount of recently generated texts.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a new organizational descriptor recognition method according to the present disclosure;
FIG. 3 is a flow chart of one embodiment of a duration determination step according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a new organizational descriptor recognition method in accordance with the present disclosure;
FIG. 5 is a schematic diagram of the structure of one embodiment of a new organization descriptor recognition apparatus according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing the electronic device of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the new organization descriptor recognition method or apparatus of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a text record application, a new organization descriptor recognition application, a web browser application, etc., may be installed on the terminal device 101.
The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices having a display screen and supporting text input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatus 101 is software, it can be installed in the electronic apparatuses listed above. It may be implemented as multiple software or software modules (e.g., to provide text neoorganizational descriptor recognition services), or as a single software or software module. And is not particularly limited herein.
The server 103 may be a server that provides various services, such as a background server that provides a new organization descriptor recognition service for text sent by the terminal device 101. The background server may analyze the received text, and feed back a processing result (e.g., new organization descriptor) to the terminal device.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be performed by both the terminal device 101 and the server 103, for example, the step of "obtaining a recent organization description related history text set" may be performed by the terminal device 101, and the remaining steps may be performed by the server 103. The present disclosure is not limited thereto. Accordingly, the new organization descriptor recognition means may be provided in the terminal device 101 and the server 103, respectively.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be executed by the server 103, and accordingly, a new organization descriptor recognition apparatus may also be disposed in the server 103, and in this case, the system architecture 100 may not include the terminal device 101.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be executed by the terminal device 101, and accordingly, the new organization descriptor recognition apparatus may also be disposed in the terminal device 101, and in this case, the system architecture 100 may not include the server 103.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide a new organization descriptor recognition service), or may be implemented as a single software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a new organizational descriptor recognition method in accordance with the present disclosure is shown. The new organization descriptor recognition method comprises the following steps:
step 201, obtaining a recent organization description related historical text set.
In this embodiment, the executing subject of the new organization descriptor recognition method (e.g., the server shown in fig. 1) may first obtain a recent organization description related history text set. Here, the recent organization description related history text set is a history text set related to the description organization generated within a last preset organization discovery duration.
Here, the preset tissue discovery period may be preset in various implementations. For example, the preset tissue discovery duration may be a length of time that is preset and stored in the execution subject by a technician based on the computational performance parameters of the execution subject and the amount of tissue description related text generated for a historical unit duration. For example, the preset tissue discovery period may be 5 days, 150 hours, or the like. It is understood that the longer the preset tissue discovery time is, the larger the data amount in the obtained recent tissue description related history text set is, and accordingly, the longer the new tissue descriptors in the recent tissue description related history text set are identified, which may prolong the time for obtaining the new tissue descriptors. In addition, if the preset tissue discovery duration is too short, the obtained text data in the related historical text set of the recent tissue description may be too small to obtain a new tissue descriptor or the obtained new tissue descriptor may not be an actual new tissue descriptor. Therefore, setting the preset tissue finding duration requires a balance between calculating the required time and determining the accuracy of the new tissue descriptors.
Here, the execution subject may obtain a recent organization description related history text set stored locally, or the execution subject may remotely obtain a recent organization description related history text set from another electronic device (for example, a terminal device shown in fig. 1) connected to the execution subject through a network.
It should be noted that the recent tissue description related history text set acquired here may be an original history text set related to the description tissue generated within the most recent preset tissue discovery duration; the obtained recent tissue description related historical text set can also be a text set obtained after preprocessing the historical text set which is generated in the original recent preset tissue discovery duration and is related to the description tissue. By way of example, preprocessing may include, but is not limited to, removing invalid characters, full half-angle conversion, and the like. The invalid characters may be, for example, a mood word, a null word, or the like.
Step 202, performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing.
In this embodiment, the executing body may perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set acquired in step 201 to obtain a corresponding word segmentation sequence, and then may generate a target word segmentation sequence set from each word segmentation sequence obtained after the word segmentation processing.
It should be noted that how to cut words of text is the prior art of extensive research and application in this field, and will not be described herein. For example, a word segmentation method based on string matching, a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be employed. For example, word segmentation for the historical text "there is a crowd reflecting a first cell and a certain rental house has suspicious people in and out often" can result in a word segmentation sequence "there is a crowd | reflecting a | first | cell | a | rental | house | has | suspicious people | in and out often".
And 203, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set.
In this embodiment, the execution main body may generate a binary concatenated word library by using a binary concatenated word composed of two adjacent segmented words in a target segmented word sequence in the target segmented word sequence set.
For example, assume that the target segmented word sequence set is { "rent | house | often | has | suspicious | person | come in and go out", "two | partnery | don't | identity | person | in | parking lot | because | parking | problem | fight" }, and the binary concatenation thesaurus obtained through step 203 is { "rent house", "house often", "there is", "suspicious", "person come in and go out", "two partnery", "don't have", "blindness", "plain", "identity", "person in parking lot", "because of parking", "parking problem", "problem fighting" }.
And 204, executing identification operation on each binary concatenated word in the binary concatenated word library.
In this embodiment, the execution subject may execute the recognition operation for each binary concatenated word in the binary concatenated word library generated in step 203. Specifically, the identifying operation may include sub-step 2041 and sub-step 2042.
And a substep 2041 of calculating the word frequency, the degree of freedom and the degree of solidity of the binary concatenated word based on the target word segmentation sequence set.
In this embodiment, the execution main body may adopt various implementation manners to calculate the word frequency, the degree of freedom, and the degree of solidity of the binary concatenated word based on the target segmented word sequence set.
And the word frequency of the binary concatenated word is used for representing the degree of the occurrence frequency of the binary concatenated word in the target word segmentation sequence set. If the degree of the occurrence frequency of the word frequency target word sequence set of the binary concatenated word is higher, the probability that the binary concatenated word is a new organizational descriptor is higher.
In some optional implementation manners, calculating the word frequency of the binary concatenated word based on the target word segmentation sequence set may be to count a sum of occurrence times of the binary concatenated word in each target word segmentation sequence of the target word segmentation sequence set, and determine the sum of the occurrence times obtained through the counting as the word frequency of the binary concatenated word.
In some optional implementations, calculating the word frequency of the binary concatenated word based on the target word segmentation sequence set may also be performed as follows: firstly, counting the sum of the occurrence times of the binary concatenated word in each target word segmentation sequence of the target word segmentation sequence set, and then determining the word frequency of the binary concatenated word by the ratio obtained by dividing the counted sum of the occurrence times by the sum of the total occurrence times of the segmented words corresponding to the target word segmentation sequence set. Here, the sum of the total occurrence times of the participles corresponding to the target participle sequence set is the sum of the occurrence times of each participle in each target participle sequence in the target participle sequence set.
The degree of solidification of the binary concatenated word is used for representing the degree of fixation or combination of two participles included in the binary concatenated word in a target participle sequence, and if the degree of fixation or combination of the binary concatenated word in a target participle sequence set is higher, the probability that the binary concatenated word is a new organization descriptor is higher.
Assuming that the binary concatenation word bank is X, for each participle X in the binary concatenation word bank X1And participle x2The binary spliced word x is formed by splicing, namely x is x1x2And the word frequency of the binary spliced word x in the target word sequence set is assumed to be P (x).
In some optional implementations, the freezing degree agglobometry (x) of the binary concatenated word x may be calculated based on the target word sequence set according to the following method:
first, a participle x can be determined1Word frequency P (x) in target word sequence set1) And word segmentation x2Word frequency P (x) in target word sequence set2). It should be noted that, P (x) may be determined by the same method as the above-mentioned method for determining the word frequency P (x) of the binary concatenated word x in the target word sequence set1) And P (x)2)。
Then, the freezing degree agglobometry (x) of the binary concatenated word x can be calculated according to the following formula:
Figure RE-GDA0002737402510000081
suppose that the binary concatenated word x and the participle x1And word segmentation x2The occurrence times in each target word segmentation sequence of the target word segmentation sequence set are n and n respectively1And n2And if the sum of the total times of occurrence of the participles corresponding to the target participle sequence set is N, and N is a positive integer, then P (x), P (x)1) And P (x)2) Can be n and n respectively1And n2Or P (x), P (x)1) And P (x)2) Or can be respectively
Figure RE-GDA0002737402510000091
And
Figure RE-GDA0002737402510000092
as can be seen from the above formula, when P (x) and P (x) are in the same state1) And P (x)2) Are respectively n and n1And n2The coagulation degree agrometration (x) of the binary concatenated word x can be expressed as follows:
Figure RE-GDA0002737402510000093
when P (x), P (x)1) And P (x)2) Are respectively as
Figure RE-GDA0002737402510000094
And
Figure RE-GDA0002737402510000095
the degree of coagulation of the binary concatenated word x, agglomerization (x), may be represented as follows:
Figure RE-GDA0002737402510000096
as can be seen from the formulas 2 and 3, the freezing degree Aglomeration (x) of the binary concatenated word x is respectively equal to the participle x1Number of occurrences n in target sequence of part words set1And word segmentation x2Number of occurrences n in target sequence of part words set2Inversely proportional to the number of occurrences n of the binary concatenated word x in the target word sequence set. Wherein:
the maximum limit of the Aggloration (x) is n1、n2And n are the same, and if the word frequency is calculated by the method shown in formula 2, the Aggloration (x) is
Figure RE-GDA0002737402510000097
Accordingly, if the word frequency is calculated by the method shown in equation 3, the Aggloration (x) is
Figure RE-GDA0002737402510000098
At this time, the situation that the corresponding binary concatenated word x appears in the target word segmentation sequence set is that only the word segmentation x needs to be divided1Occurrence and word segmentation x2Appear together and only have to be participled x2Occurrence and word segmentation x1Appear together without x1Occurring or participled x individually2Appearing alone, indicating a binary concatenated word x1x2The probability of use in combination as a word is high.
Conversely, the minimum limit of the Aggloration (x) is that n is 1 and n is1And/or n2If the word frequency is calculated by the method shown in formula 2, the aggregate ratio (x) is greater than 1
Figure RE-GDA0002737402510000099
Accordingly, if the word frequency is calculated by the method shown in equation 3, the Aggloration (x) is
Figure RE-GDA00027374025100000910
At this time, the situation that the corresponding binary concatenated word x appears in the target word segmentation sequence set is that the word segmentation x1Only once with word segmentation x2Taken together, in other cases the word segmentation x1Occurring or participled x individually2Appearing alone, indicating a binary concatenated word x1x2The probability of use in combination as a word is low.
It can be understood that other methods may also be adopted to calculate the freezing degree aglomeration (x) of the binary concatenated word x based on the target segmented word sequence set, as long as the freezing degree aglomeration (x) of the binary concatenated word x and the segmented word x are respectively equal to each other1Number of occurrences n in target sequence of part words set1And word segmentation x2Number of occurrences n in target sequence of part words set2And negative correlation is carried out, and positive correlation is carried out on the occurrence frequency n of the binary spliced word x in the target word segmentation sequence set. For example, the freezing degree agglobometry (x) of the binary concatenated word x can be calculated by the following formula 4 or formula 5:
Figure RE-GDA0002737402510000101
Agglomeration(x)=P(x1)+P(x2)-P(x1x2) (formula 5)
The degree of freedom of the binary concatenated word is used for representing the degree of free combination of the binary concatenated word as a whole with other segmented words in the target segmented word sequence, that is, if the preceding word and the following word are relatively fixed, the degree of freedom of the binary concatenated word as a whole is considered to be low, and the binary concatenated word may not be a new organizational descriptor. On the contrary, if the binary concatenated word is taken as a whole, if the preceding word and the following word are more variable, the degree of freedom of the binary concatenated word can be considered to be higher, and the binary concatenated word may be a new organization descriptor and can be freely combined with other surrounding words.
Here, the above description about X, X continues1、x2、P(x)、P(x1) And P (x)2) Can be based on the purpose in the following method in some alternative implementationsAnd (3) calculating the degree of freedom free (x) of the binary spliced word x by the labeled word sequence set:
firstly, generating a preamble adjacent word set Pre corresponding to a binary spliced word x by using each participle which is positioned in front of the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of a target participle sequence setx
Second, statistics is carried out on the adjacent word set Pre of the preamblexThe word frequency p (y) of each word y in the target set of word sequences.
Thirdly, generating a subsequent adjacent word set Post corresponding to the binary spliced word x by using each participle which is positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx
Fourthly, counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of sequences of words.
And fifthly, calculating the degree of freedom free (x) of the binary spliced word x according to the following formula.
Figure RE-GDA0002737402510000102
Figure RE-GDA0002737402510000103
Free(x)=min(H(Prex),H(Postx) Equation 8)
As can be seen from the above description and from equations 6, 7 and 8, H (Pre)x) Namely the preamble adjacent word set Pre corresponding to the binary concatenation word xxIs reflected by the preface adjacent word set Pre corresponding to the binary concatenated word xxThe degree of variation of (a) can also be understood as the degree of freedom of the participle before the binary concatenated word x. H (Post)x) Namely the Post adjacent word set Post corresponding to the binary spliced word xxIs reflected by the Post-adjacent word set Post corresponding to the binary concatenated word xxThe degree of variation of (a) can also be understood as the degree of freedom of the participle after the binary concatenated word x. The degree of freedom free (x) of the binary concatenation word x is H (Pre)x) And H (Post)x) The smaller value of the two-dimensional concatenation word x is the smaller value of the degree of freedom free (x) of the two-dimensional concatenation word, which is reflected by the smaller value of the degree of change of the corresponding preceding adjacent word set and the degree of change of the following adjacent word set of the two-dimensional concatenation word. When the degree of freedom free (x) of the binary concatenated word x is larger, it indicates that the degree of change of the word before and after the binary concatenated word x is higher, i.e. the degree of freedom of the binary concatenated word x in combination with other words is higher, and the probability that the binary concatenated word x is a new organizational descriptor is higher.
Substep 2042, in response to determining that the binary concatenated word satisfies each condition in the preset set of new word discovery conditions, determining the binary concatenated word as a new organizational descriptor.
Here, the execution subject may determine whether the binary spliced word satisfies each condition of a preset new word discovery condition group. If it is determined that the binary concatenation word is satisfied, the binary concatenation word may be determined to be a new organizational descriptor. Wherein the preset new word discovery condition group may include at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
The explanation in substep 2041 continues, and let T be assumedp、TaAnd TfThe preset word frequency threshold, the preset freezing degree threshold, and the preset degree of freedom threshold are respectively, and the preset new word discovery condition group may include at least one of the following conditions:
the first condition is as follows: p (x)>Tp
And a second condition: aggloration (x)>Ta
And (3) carrying out a third condition: free (x)>Tf
In practice, the preset word frequency threshold, the preset freezing degree threshold and the preset degree of freedom threshold may be manually set by a technician according to experience and stored in the execution body.
As can be seen from the description of sub-step 2041, if each condition in the preset new word discovery condition set is satisfied, indicating that the binary conjunct x has a high probability of being a new organizational descriptor, the binary conjunct may be determined as a new organizational descriptor.
In some alternative implementations, the preset tissue discovery period recorded in step 201 may be predetermined by the period determination step shown in fig. 3. Referring to fig. 3, fig. 3 shows a flow 300 of one embodiment of the duration determination step according to the present disclosure. The time length determining step comprises the following steps:
here, the execution subject of the duration determination step may be the same as the execution subject of the above-described new organization descriptor recognition method. In this way, the execution subject of the duration determination step may store the determined preset tissue discovery duration in the local execution subject after determining the preset tissue discovery duration, and read the determined preset tissue discovery duration during the execution of the new tissue descriptor recognition method.
Here, the execution subject of the time length determination step may also be different from the execution subject of the above-described new organization descriptor recognition method. In this way, the execution main body of the time length determination step may send the determined preset tissue discovery time length to the execution main body of the new tissue descriptor recognition method locally after determining the preset tissue discovery time length. In this way, the executing body of the new organization descriptor recognition method may read the received preset organization discovery period in the process of executing the new organization descriptor recognition method.
Step 301, for each candidate duration in the preset candidate duration set, performing an identification accuracy determination operation.
Here, the preset candidate duration set may be a set consisting of at least one candidate duration. The time units of the candidate durations may be the same or different. For example, the time unit of the candidate duration may be day, hour, or both day and hour. As an example, the preset candidate duration set may be {1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 7 days }.
Here, the execution subject of the duration determination step may execute the recognition accuracy determination operation for each candidate duration in the preset candidate duration set, and specifically, the recognition accuracy determination operation may include sub-steps 3011 to 3015:
sub-step 3011, obtain the historical text set related to the description organization generated in the candidate duration recently, and the corresponding set of annotation new organization descriptors.
In practice, a new organization descriptor set for describing a new organization that has not occurred historically may be manually marked out from the historical text set related to the description organization generated within the candidate duration.
Here, assuming that the candidate duration is 3 days in the preset candidate duration set of the above example, here in sub-step 3011, a historical text set related to the description organization generated in the last 3 days and a corresponding new organization descriptor set are obtained.
And a substep 3012 of performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing.
Here, how to cut words of the text to obtain the word segmentation sequence may refer to the related description in step 202, and is not described herein again.
And a substep 3013, generating a binary concatenated word library corresponding to the candidate duration by using a binary concatenated word composed of two adjacent segmented words in the segmented word sequence set corresponding to the candidate duration.
Sub-step 3014, for each binary-spliced word in the binary-spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom, and the degree of solidity of the binary-spliced word based on the set of participle sequences corresponding to the candidate duration, and in response to determining that the binary-spliced word satisfies each condition in the set of preset new word discovery conditions and the binary-spliced word belongs to the set of tagged new organizational descriptors, or in response to determining that the binary-spliced word does not satisfy at least one condition in the set of preset new word discovery conditions and the binary-spliced word does not belong to the set of tagged new organizational descriptors, determining the binary-spliced word as the recognized correct word.
Here, the execution subject of the duration determination step may determine, for each binary-spliced word in the binary-spliced word library corresponding to the candidate duration generated in sub-step 3013, the binary-spliced word as the correct word in response to determining that the binary-spliced word satisfies each condition in the preset new word discovery condition group and that the binary-spliced word belongs to the annotated new organizational descriptor set, or in response to determining that the binary-spliced word does not satisfy at least one condition in the preset new word discovery condition group and that the binary-spliced word does not belong to the annotated new organizational descriptor set. That is, if a new word discovery condition set is preset as described above, the binary concatenated word is a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is also a new organization descriptor, it may be considered that a condition set is found according to a preset new word, the binary concatenated word is correctly identified, and the binary concatenated word may be determined as an identified correct word. Similarly, if the new word discovery condition set is preset according to the above, the binary concatenated word is not a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is not a new organization descriptor, it may be considered that a condition set is found according to a preset new word, the binary concatenated word is also correctly identified, and the binary concatenated word may be determined as an identified correct word. Otherwise, if the condition set is found according to the preset new words, the binary concatenated words are new organization descriptors. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is not a new organization descriptor, it may be considered that a condition set is found according to a preset new word, and if the binary concatenated word is recognized incorrectly, the binary concatenated word may be determined as a recognized incorrect word. Similarly, if the new word discovery condition set is preset according to the above, the binary concatenated word is not a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is a new organization descriptor, it may be considered that a condition group is found according to a preset new word, the binary concatenated word is also recognized as an error, and the binary concatenated word may be determined as a recognized error word.
Sub-step 3015, determining the ratio of the number of correct recognized words in the binary concatenated word library corresponding to the candidate duration to the number of binary concatenated words in the binary concatenated word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration.
Since it has been determined in sub-step 3014 whether each binary-spliced word in the binary-spliced word bank corresponding to the candidate duration is an identified correct word, a ratio of the number of identified correct words in the binary-spliced word bank corresponding to the candidate duration divided by the number of binary-spliced words in the binary-spliced word bank corresponding to the candidate duration may be determined as the identification accuracy corresponding to the candidate duration in sub-step 3015.
Step 302, determining the corresponding candidate duration with the highest recognition accuracy in the preset candidate duration set as the preset tissue discovery duration.
After step 301, the identification accuracy corresponding to each candidate duration in the preset candidate duration set is determined, where the candidate duration with the highest identification accuracy in the preset candidate duration set may be determined as the preset tissue discovery duration.
The preset tissue discovery duration determined according to the duration determining step shown in fig. 3 can be used for acquiring the historical text set related to the description tissue generated in the preset tissue discovery duration determined according to the duration determining step shown in fig. 3 when acquiring the recent tissue description related historical text set in the process of executing the new organizing word recognition method, and because the preset tissue discovery duration determined according to the duration determining step shown in fig. 3 is the corresponding preset candidate duration set with the highest recognition accuracy, the historical text set related to the description tissue generated in a longer time is not required to be acquired in order to improve the recognition accuracy, the calculation amount is reduced, and then the calculation efficiency and the recognition effect can be considered.
The method provided by the above embodiment of the present disclosure obtains a recent tissue description related history text set that is generated within a recent preset tissue discovery duration and is related to a description tissue. And generating a binary concatenation word library corresponding to the recent organization description related historical text set. And finally, for each binary spliced word in the binary spliced word library, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set, and if the binary spliced word is determined to meet each condition in a preset new word discovery condition group, determining the binary spliced word as a new organization descriptor. According to the method for identifying the new organization descriptors, the whole process does not need manual operation, and the labor cost and the time cost for discovering the new organization descriptors are reduced.
With further reference to fig. 4, a flow 400 of yet another embodiment of a new organizational descriptor recognition method is shown. The process 400 of the new organization descriptor recognition method includes the following steps:
step 401, obtaining a recent organization description related historical text set.
In this embodiment, the specific operation and the technical effect of step 401 are substantially the same as those of step 201 in the embodiment shown in fig. 2, and are not repeated herein.
Step 402, performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing.
In this embodiment, the executing body of the new organization descriptor recognition method may adopt a dictionary-based word segmentation method, perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set acquired in step 401 based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing.
In practice, the dictionary-based word segmentation method may include a forward maximum matching method, a reverse maximum matching method, and a bidirectional matching word segmentation method according to different scanning directions. The dictionary-based word segmentation method may refer to matching a word string to be analyzed (e.g., each recent organization description related history text in the recent organization description related history text set in step 402) with entries in a preset word segmentation dictionary according to a certain policy, segmenting the word string into words if the word string exists in the dictionary, and then performing matching of a next word string.
And 403, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set.
And step 404, executing identification operation on each binary concatenated word in the binary concatenated word library.
In this embodiment, the specific operations of step 403 and step 404 and the technical effects thereof are substantially the same as the operations and effects of step 203 and step 204 in the embodiment shown in fig. 2, and are not repeated herein.
And 405, adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into a preset word segmentation dictionary.
In this embodiment, the execution subject may add each binary concatenated word determined as the new organization descriptor in step 404 in the binary concatenated word library generated in step 403 to the preset word segmentation dictionary. Therefore, when the new organization descriptor recognition method is executed again next time, the new organization descriptor recognized this time is already added into the preset word segmentation dictionary, namely the preset word segmentation dictionary is updated, and the new organization descriptor recognized this time will not be recognized as a new organization descriptor next time.
It should be noted that the preset word segmentation dictionary may be obtained by gradually adding new organization descriptors on the basis of the general word segmentation dictionary.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the new organization descriptor recognition method in this embodiment has more steps to update the preset word dictionary. Therefore, the scheme described in this embodiment can update the preset word segmentation dictionary in real time, so that when the new organization descriptor is identified next time, because the word identified as the new organization descriptor at this time is already added into the preset word segmentation dictionary, the word identified as the new organization descriptor once will not be identified as the new organization descriptor again in the future.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a new organization descriptor recognition apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the new organization descriptor recognition apparatus 500 of the present embodiment includes: an acquisition unit 501, a first generation unit 502, a second generation unit 503, and a recognition unit 504. The acquiring unit 501 is configured to acquire a recent tissue description related history text set, where the recent tissue description related history text set is a history text set related to a description tissue generated within a recent preset tissue discovery duration; a first generating unit 502, configured to perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; a second generating unit 503 configured to generate a binary concatenated word library by using a binary concatenated word composed of two adjacent segmented words in the target segmented word sequence set; the identifying unit 504 is configured to perform the following identifying operation for each binary concatenated word in the binary concatenated word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In this embodiment, specific processes of the obtaining unit 501, the first generating unit 502, the second generating unit 503, and the identifying unit 504 of the new organization descriptor identifying apparatus 500 and technical effects brought by the specific processes may refer to related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence may include: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the apparatus 500 may further include: and an adding unit 505 configured to add each binary-spliced word determined as a new organization descriptor in the binary-spliced word library to the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery duration may be predetermined by the following duration determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, the calculating, for each binary concatenated word in the binary concatenated word library, a word frequency, a degree of freedom, and a degree of solidity of the binary concatenated word based on the target word segmentation sequence set may include: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency P (x) in the target word sequence set1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree agglomerization (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-GDA0002737402510000161
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-GDA0002737402510000171
Figure RE-GDA0002737402510000172
Free(x)=min(H(Prex),H(Postx))
it should be noted that details of implementation and technical effects of each unit in the new organization descriptor recognition apparatus provided in the present disclosure may refer to descriptions of other embodiments in the present disclosure, and are not described herein again.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing the electronic device of the present disclosure is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the present disclosure.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input section 606 including a touch screen, a tablet, a keyboard, a mouse, or the like; an output section 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication section 609. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, Python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in this disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first generation unit, a second generation unit, and a recognition unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the obtaining unit may also be described as a "unit that obtains a recent organization description related history text set".
As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a latest preset organization discovery duration and is related to a description organization; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set; for each binary concatenation word in the binary concatenation word library, executing the following identification operation: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A new organizational descriptor recognition method, comprising:
acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical alarm receiving and handling text set which is generated within a latest preset organization discovery duration and is related to a description organization;
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing;
generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set;
for each binary concatenated word in the binary concatenated word library, performing the following identification operations: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
2. The method of claim 1, wherein the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence comprises:
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and
the method further comprises the following steps:
and adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into the preset word segmentation dictionary.
3. The method according to claim 1 or 2, wherein the preset tissue discovery duration is predetermined by the duration determination step of:
for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration;
and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
4. The method of claim 3, wherein the calculating, for each binary spliced word in the binary spliced thesaurus, the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set comprises:
for each participle X in the binary concatenated lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x:
counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, wherein the word segmentation x1Word frequency P (x) in the target set of partial word sequences1) And word segmentation x2Word frequency P (x) in the target set of partial word sequences2);
The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
Figure RE-FDA0002737402500000021
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx
Counting the preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set;
generating a Post-order adjacent word set Post corresponding to the binary concatenated word x by using each participle positioned behind the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx
Counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of word sequences;
the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-FDA0002737402500000022
Figure RE-FDA0002737402500000031
Free(x)=min(H(Prex),H(Postx)) 。
5. a new tissue descriptor recognition apparatus comprising:
the obtaining unit is configured to obtain a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a recent preset organization discovery duration and is related to a description organization;
the first generation unit is configured to perform word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing;
a second generation unit configured to generate a binary concatenated word library using a binary concatenated word composed of two adjacent participles in a target participle sequence in the target participle sequence set;
the recognition unit is configured to execute the following recognition operation for each binary concatenation word in the binary concatenation word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
6. The apparatus of claim 5, wherein the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence comprises:
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and
the device further comprises:
and the adding unit is configured to add each binary spliced word determined as the new organization descriptor in the binary spliced word library into the preset word segmentation dictionary.
7. The apparatus according to claim 5 or 6, wherein the preset tissue discovery duration is predetermined by the duration determination step of:
for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration;
and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
8. The apparatus of claim 7, wherein the calculating, for each binary spliced word in the binary spliced thesaurus, a word frequency, a degree of freedom, and a degree of solidity of the binary spliced word based on the target word sequence set comprises:
for each participle X in the binary concatenated lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x:
counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, wherein the word segmentation x1Word frequency P (x) in the target set of partial word sequences1) And word segmentation x2Word frequency P (x) in the target set of partial word sequences2);
The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
Figure RE-FDA0002737402500000041
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx
Counting the preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set;
generating a Post-order adjacent word set Post corresponding to the binary concatenated word x by using each participle positioned behind the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx
Counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of word sequences;
the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Figure RE-FDA0002737402500000051
Figure RE-FDA0002737402500000052
Free(x)=min(H(Prex),H(Postx)) 。
9. an electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-4.
CN202010435003.3A 2020-05-21 2020-05-21 New organization descriptor recognition method and device, electronic equipment and storage medium Active CN112329458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010435003.3A CN112329458B (en) 2020-05-21 2020-05-21 New organization descriptor recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010435003.3A CN112329458B (en) 2020-05-21 2020-05-21 New organization descriptor recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112329458A true CN112329458A (en) 2021-02-05
CN112329458B CN112329458B (en) 2024-05-10

Family

ID=74302841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010435003.3A Active CN112329458B (en) 2020-05-21 2020-05-21 New organization descriptor recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112329458B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
US20150079554A1 (en) * 2012-05-17 2015-03-19 Postech Academy-Industry Foundation Language learning system and learning method
CN104951428A (en) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 User intention recognition method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
US20160098383A1 (en) * 2014-10-07 2016-04-07 International Business Machines Corporation Implicit Durations Calculation and Similarity Comparison in Question Answering Systems
CN106776542A (en) * 2016-11-23 2017-05-31 北京小米移动软件有限公司 The crucial word treatment method of field feedback, device and server
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108319582A (en) * 2017-12-29 2018-07-24 北京城市网邻信息技术有限公司 Processing method, device and the server of text message
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN110457595A (en) * 2019-08-01 2019-11-15 腾讯科技(深圳)有限公司 Emergency event alarm method, device, system, electronic equipment and storage medium
CN111147905A (en) * 2019-12-31 2020-05-12 深圳Tcl数字技术有限公司 Media resource searching method, television, storage medium and device
CN111159557A (en) * 2019-12-31 2020-05-15 北京奇艺世纪科技有限公司 Hotspot information acquisition method, device, server and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150079554A1 (en) * 2012-05-17 2015-03-19 Postech Academy-Industry Foundation Language learning system and learning method
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN104951428A (en) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 User intention recognition method and device
US20160098383A1 (en) * 2014-10-07 2016-04-07 International Business Machines Corporation Implicit Durations Calculation and Similarity Comparison in Question Answering Systems
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN106776542A (en) * 2016-11-23 2017-05-31 北京小米移动软件有限公司 The crucial word treatment method of field feedback, device and server
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108319582A (en) * 2017-12-29 2018-07-24 北京城市网邻信息技术有限公司 Processing method, device and the server of text message
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN110457595A (en) * 2019-08-01 2019-11-15 腾讯科技(深圳)有限公司 Emergency event alarm method, device, system, electronic equipment and storage medium
CN111147905A (en) * 2019-12-31 2020-05-12 深圳Tcl数字技术有限公司 Media resource searching method, television, storage medium and device
CN111159557A (en) * 2019-12-31 2020-05-15 北京奇艺世纪科技有限公司 Hotspot information acquisition method, device, server and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
介飞: "社交网络中隐式事件突发性检测", 自动化学报, vol. 44, no. 04, 11 December 2017 (2017-12-11), pages 730 - 742 *

Also Published As

Publication number Publication date
CN112329458B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
CN114612759B (en) Video processing method, video query method, model training method and model training device
CN111061877A (en) Text theme extraction method and device
CN111368551A (en) Method and device for determining event subject
CN111626054B (en) Novel illegal action descriptor recognition method and device, electronic equipment and storage medium
CN111078849A (en) Method and apparatus for outputting information
CN113590756A (en) Information sequence generation method and device, terminal equipment and computer readable medium
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN113111233A (en) Regular expression-based method and device for extracting residential address of alarm receiving and processing text
CN113111167A (en) Method and device for extracting vehicle model of alarm receiving and processing text based on deep learning model
CN112329458B (en) New organization descriptor recognition method and device, electronic equipment and storage medium
WO2022148239A1 (en) Method and apparatus for information output, and electronic device
CN108628909B (en) Information pushing method and device
CN113111230B (en) Regular expression-based alarm receiving text home address extraction method and device
CN115098729A (en) Video processing method, sample generation method, model training method and device
CN111666449B (en) Video retrieval method, apparatus, electronic device, and computer-readable medium
CN112131874A (en) New group descriptor recognition method and device, electronic device and storage medium
CN113111174A (en) Group identification method, device, equipment and medium based on deep learning model
CN114490400A (en) Method and device for processing test cases
CN114066603A (en) Post-loan risk early warning method and device, electronic equipment and computer readable medium
CN113239259A (en) Method and device for determining similar stores
CN109308299B (en) Method and apparatus for searching information
CN111626053B (en) New scheme means descriptor recognition method and device, electronic equipment and storage medium
CN113094499A (en) Deep learning model-based organization identification method and device, equipment and medium
CN112650830B (en) Keyword extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant