CN112329458A - New organization descriptor recognition method and device, electronic device and storage medium - Google Patents
New organization descriptor recognition method and device, electronic device and storage medium Download PDFInfo
- Publication number
- CN112329458A CN112329458A CN202010435003.3A CN202010435003A CN112329458A CN 112329458 A CN112329458 A CN 112329458A CN 202010435003 A CN202010435003 A CN 202010435003A CN 112329458 A CN112329458 A CN 112329458A
- Authority
- CN
- China
- Prior art keywords
- word
- binary
- spliced
- preset
- organization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008520 organization Effects 0.000 title claims abstract description 188
- 238000000034 method Methods 0.000 title claims abstract description 76
- 230000011218 segmentation Effects 0.000 claims abstract description 222
- 230000004044 response Effects 0.000 claims description 22
- 238000007711 solidification Methods 0.000 claims description 21
- 230000008023 solidification Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000015271 coagulation Effects 0.000 claims description 7
- 238000005345 coagulation Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 230000008014 freezing Effects 0.000 description 8
- 238000007710 freezing Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 239000000835 fiber Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 238000005054 agglomeration Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Technology Law (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The disclosure provides a new organization descriptor recognition method and apparatus, an electronic device, and a storage medium. One embodiment of the method comprises: acquiring a recent organization description related historical text set; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence set; for each binary concatenated word in the binary concatenated word repository, a recognition operation is performed to determine whether the binary concatenated word is a new organizational descriptor. The embodiment realizes automatic extraction of new organization descriptors in the recent organization description related historical text set.
Description
Technical Field
The disclosure relates to the technical field of computers, in particular to a new organization descriptor identification method and device, an electronic device and a storage medium.
Background
At present, new organization descriptors in recently generated texts are basically extracted manually, the cost of required manpower and time is high, novel organizations and activities or behaviors thereof cannot be found and processed in time, and hidden dangers are caused to the society. In addition, most texts are described by natural language, the expression mode is seriously spoken and irregular, the manual extraction difficulty is high, and the learning cost is high in the process of manually extracting new organization descriptors depending on manual experience.
Disclosure of Invention
The disclosure provides a new organization descriptor recognition method and device, an electronic device and a storage medium.
In a first aspect, the present disclosure provides a new organizational descriptor recognition method, including: acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a latest preset organization discovery duration and is related to a description organization; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set; for each binary concatenation word in the binary concatenation word library, executing the following identification operation: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence includes: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the above method further comprises: and adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery period is predetermined by the following period determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, the above two for each binary concatenation lexicon in the above binary concatenation lexiconThe meta-concatenation word, which is used for calculating the word frequency, the degree of freedom and the degree of solidification of the binary-concatenation word based on the target word segmentation sequence set, comprises the following steps: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency P (x) in the target word sequence set1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Free(x)=min(H(Prex),H(Postx))
in a second aspect, the present disclosure provides a new organizational descriptor recognition apparatus, the apparatus comprising: the acquisition unit is configured to acquire a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a recent preset organization discovery duration and is related to a description organization; the first generation unit is configured to perform word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; the second generation unit is configured to generate a binary concatenated word library by using a binary concatenated word formed by two adjacent participles in a target participle sequence in the target participle sequence set; the recognition unit is configured to execute the following recognition operation on each binary concatenation word in the binary concatenation word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence includes: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the above apparatus further comprises: and the adding unit is configured to add each binary spliced word determined as the new organization descriptor in the binary spliced word library into the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery period is predetermined by the following period determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, for each binary concatenated word in the binary concatenated word library, calculating the word frequency, the degree of freedom, and the degree of solidity of the binary concatenated word based on the target word segmentation sequence set includes: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency in the target part-word sequence setP(x1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Free(x)=min(H(Prex),H(Postx))
in a third aspect, the present disclosure provides an electronic device, comprising: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect.
In order to identify new organizational descriptors from recently generated text, applicants have discovered through research that if a binary concatenation word occurs frequently in recent organizational description-related historical text, it is highly likely that the binary concatenation word is a new word that is used to describe the organization. Based on the above findings, the new organization descriptor recognition method and apparatus provided by the present disclosure first obtain a recent organization description related history text set related to a description organization generated within a last preset organization discovery duration. And then, word segmentation processing is carried out on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and a target word segmentation sequence set is generated by using each word segmentation sequence obtained after the word segmentation processing. And then, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set. Then, for each binary spliced word in the binary spliced word library, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set. According to the method for identifying the new organization descriptors, the whole process does not need manual operation, the labor cost and the time cost for finding the new organization descriptors are reduced, and the method can quickly identify the new organization descriptors from a large amount of recently generated texts.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a new organizational descriptor recognition method according to the present disclosure;
FIG. 3 is a flow chart of one embodiment of a duration determination step according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a new organizational descriptor recognition method in accordance with the present disclosure;
FIG. 5 is a schematic diagram of the structure of one embodiment of a new organization descriptor recognition apparatus according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing the electronic device of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the new organization descriptor recognition method or apparatus of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a text record application, a new organization descriptor recognition application, a web browser application, etc., may be installed on the terminal device 101.
The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices having a display screen and supporting text input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatus 101 is software, it can be installed in the electronic apparatuses listed above. It may be implemented as multiple software or software modules (e.g., to provide text neoorganizational descriptor recognition services), or as a single software or software module. And is not particularly limited herein.
The server 103 may be a server that provides various services, such as a background server that provides a new organization descriptor recognition service for text sent by the terminal device 101. The background server may analyze the received text, and feed back a processing result (e.g., new organization descriptor) to the terminal device.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be performed by both the terminal device 101 and the server 103, for example, the step of "obtaining a recent organization description related history text set" may be performed by the terminal device 101, and the remaining steps may be performed by the server 103. The present disclosure is not limited thereto. Accordingly, the new organization descriptor recognition means may be provided in the terminal device 101 and the server 103, respectively.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be executed by the server 103, and accordingly, a new organization descriptor recognition apparatus may also be disposed in the server 103, and in this case, the system architecture 100 may not include the terminal device 101.
In some cases, the new organization descriptor recognition method provided by the present disclosure may be executed by the terminal device 101, and accordingly, the new organization descriptor recognition apparatus may also be disposed in the terminal device 101, and in this case, the system architecture 100 may not include the server 103.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide a new organization descriptor recognition service), or may be implemented as a single software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a new organizational descriptor recognition method in accordance with the present disclosure is shown. The new organization descriptor recognition method comprises the following steps:
In this embodiment, the executing subject of the new organization descriptor recognition method (e.g., the server shown in fig. 1) may first obtain a recent organization description related history text set. Here, the recent organization description related history text set is a history text set related to the description organization generated within a last preset organization discovery duration.
Here, the preset tissue discovery period may be preset in various implementations. For example, the preset tissue discovery duration may be a length of time that is preset and stored in the execution subject by a technician based on the computational performance parameters of the execution subject and the amount of tissue description related text generated for a historical unit duration. For example, the preset tissue discovery period may be 5 days, 150 hours, or the like. It is understood that the longer the preset tissue discovery time is, the larger the data amount in the obtained recent tissue description related history text set is, and accordingly, the longer the new tissue descriptors in the recent tissue description related history text set are identified, which may prolong the time for obtaining the new tissue descriptors. In addition, if the preset tissue discovery duration is too short, the obtained text data in the related historical text set of the recent tissue description may be too small to obtain a new tissue descriptor or the obtained new tissue descriptor may not be an actual new tissue descriptor. Therefore, setting the preset tissue finding duration requires a balance between calculating the required time and determining the accuracy of the new tissue descriptors.
Here, the execution subject may obtain a recent organization description related history text set stored locally, or the execution subject may remotely obtain a recent organization description related history text set from another electronic device (for example, a terminal device shown in fig. 1) connected to the execution subject through a network.
It should be noted that the recent tissue description related history text set acquired here may be an original history text set related to the description tissue generated within the most recent preset tissue discovery duration; the obtained recent tissue description related historical text set can also be a text set obtained after preprocessing the historical text set which is generated in the original recent preset tissue discovery duration and is related to the description tissue. By way of example, preprocessing may include, but is not limited to, removing invalid characters, full half-angle conversion, and the like. The invalid characters may be, for example, a mood word, a null word, or the like.
In this embodiment, the executing body may perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set acquired in step 201 to obtain a corresponding word segmentation sequence, and then may generate a target word segmentation sequence set from each word segmentation sequence obtained after the word segmentation processing.
It should be noted that how to cut words of text is the prior art of extensive research and application in this field, and will not be described herein. For example, a word segmentation method based on string matching, a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be employed. For example, word segmentation for the historical text "there is a crowd reflecting a first cell and a certain rental house has suspicious people in and out often" can result in a word segmentation sequence "there is a crowd | reflecting a | first | cell | a | rental | house | has | suspicious people | in and out often".
And 203, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set.
In this embodiment, the execution main body may generate a binary concatenated word library by using a binary concatenated word composed of two adjacent segmented words in a target segmented word sequence in the target segmented word sequence set.
For example, assume that the target segmented word sequence set is { "rent | house | often | has | suspicious | person | come in and go out", "two | partnery | don't | identity | person | in | parking lot | because | parking | problem | fight" }, and the binary concatenation thesaurus obtained through step 203 is { "rent house", "house often", "there is", "suspicious", "person come in and go out", "two partnery", "don't have", "blindness", "plain", "identity", "person in parking lot", "because of parking", "parking problem", "problem fighting" }.
And 204, executing identification operation on each binary concatenated word in the binary concatenated word library.
In this embodiment, the execution subject may execute the recognition operation for each binary concatenated word in the binary concatenated word library generated in step 203. Specifically, the identifying operation may include sub-step 2041 and sub-step 2042.
And a substep 2041 of calculating the word frequency, the degree of freedom and the degree of solidity of the binary concatenated word based on the target word segmentation sequence set.
In this embodiment, the execution main body may adopt various implementation manners to calculate the word frequency, the degree of freedom, and the degree of solidity of the binary concatenated word based on the target segmented word sequence set.
And the word frequency of the binary concatenated word is used for representing the degree of the occurrence frequency of the binary concatenated word in the target word segmentation sequence set. If the degree of the occurrence frequency of the word frequency target word sequence set of the binary concatenated word is higher, the probability that the binary concatenated word is a new organizational descriptor is higher.
In some optional implementation manners, calculating the word frequency of the binary concatenated word based on the target word segmentation sequence set may be to count a sum of occurrence times of the binary concatenated word in each target word segmentation sequence of the target word segmentation sequence set, and determine the sum of the occurrence times obtained through the counting as the word frequency of the binary concatenated word.
In some optional implementations, calculating the word frequency of the binary concatenated word based on the target word segmentation sequence set may also be performed as follows: firstly, counting the sum of the occurrence times of the binary concatenated word in each target word segmentation sequence of the target word segmentation sequence set, and then determining the word frequency of the binary concatenated word by the ratio obtained by dividing the counted sum of the occurrence times by the sum of the total occurrence times of the segmented words corresponding to the target word segmentation sequence set. Here, the sum of the total occurrence times of the participles corresponding to the target participle sequence set is the sum of the occurrence times of each participle in each target participle sequence in the target participle sequence set.
The degree of solidification of the binary concatenated word is used for representing the degree of fixation or combination of two participles included in the binary concatenated word in a target participle sequence, and if the degree of fixation or combination of the binary concatenated word in a target participle sequence set is higher, the probability that the binary concatenated word is a new organization descriptor is higher.
Assuming that the binary concatenation word bank is X, for each participle X in the binary concatenation word bank X1And participle x2The binary spliced word x is formed by splicing, namely x is x1x2And the word frequency of the binary spliced word x in the target word sequence set is assumed to be P (x).
In some optional implementations, the freezing degree agglobometry (x) of the binary concatenated word x may be calculated based on the target word sequence set according to the following method:
first, a participle x can be determined1Word frequency P (x) in target word sequence set1) And word segmentation x2Word frequency P (x) in target word sequence set2). It should be noted that, P (x) may be determined by the same method as the above-mentioned method for determining the word frequency P (x) of the binary concatenated word x in the target word sequence set1) And P (x)2)。
Then, the freezing degree agglobometry (x) of the binary concatenated word x can be calculated according to the following formula:
suppose that the binary concatenated word x and the participle x1And word segmentation x2The occurrence times in each target word segmentation sequence of the target word segmentation sequence set are n and n respectively1And n2And if the sum of the total times of occurrence of the participles corresponding to the target participle sequence set is N, and N is a positive integer, then P (x), P (x)1) And P (x)2) Can be n and n respectively1And n2Or P (x), P (x)1) And P (x)2) Or can be respectivelyAnd
as can be seen from the above formula, when P (x) and P (x) are in the same state1) And P (x)2) Are respectively n and n1And n2The coagulation degree agrometration (x) of the binary concatenated word x can be expressed as follows:
when P (x), P (x)1) And P (x)2) Are respectively asAndthe degree of coagulation of the binary concatenated word x, agglomerization (x), may be represented as follows:
as can be seen from the formulas 2 and 3, the freezing degree Aglomeration (x) of the binary concatenated word x is respectively equal to the participle x1Number of occurrences n in target sequence of part words set1And word segmentation x2Number of occurrences n in target sequence of part words set2Inversely proportional to the number of occurrences n of the binary concatenated word x in the target word sequence set. Wherein:
the maximum limit of the Aggloration (x) is n1、n2And n are the same, and if the word frequency is calculated by the method shown in formula 2, the Aggloration (x) isAccordingly, if the word frequency is calculated by the method shown in equation 3, the Aggloration (x) isAt this time, the situation that the corresponding binary concatenated word x appears in the target word segmentation sequence set is that only the word segmentation x needs to be divided1Occurrence and word segmentation x2Appear together and only have to be participled x2Occurrence and word segmentation x1Appear together without x1Occurring or participled x individually2Appearing alone, indicating a binary concatenated word x1x2The probability of use in combination as a word is high.
Conversely, the minimum limit of the Aggloration (x) is that n is 1 and n is1And/or n2If the word frequency is calculated by the method shown in formula 2, the aggregate ratio (x) is greater than 1Accordingly, if the word frequency is calculated by the method shown in equation 3, the Aggloration (x) isAt this time, the situation that the corresponding binary concatenated word x appears in the target word segmentation sequence set is that the word segmentation x1Only once with word segmentation x2Taken together, in other cases the word segmentation x1Occurring or participled x individually2Appearing alone, indicating a binary concatenated word x1x2The probability of use in combination as a word is low.
It can be understood that other methods may also be adopted to calculate the freezing degree aglomeration (x) of the binary concatenated word x based on the target segmented word sequence set, as long as the freezing degree aglomeration (x) of the binary concatenated word x and the segmented word x are respectively equal to each other1Number of occurrences n in target sequence of part words set1And word segmentation x2Number of occurrences n in target sequence of part words set2And negative correlation is carried out, and positive correlation is carried out on the occurrence frequency n of the binary spliced word x in the target word segmentation sequence set. For example, the freezing degree agglobometry (x) of the binary concatenated word x can be calculated by the following formula 4 or formula 5:
Agglomeration(x)=P(x1)+P(x2)-P(x1x2) (formula 5)
The degree of freedom of the binary concatenated word is used for representing the degree of free combination of the binary concatenated word as a whole with other segmented words in the target segmented word sequence, that is, if the preceding word and the following word are relatively fixed, the degree of freedom of the binary concatenated word as a whole is considered to be low, and the binary concatenated word may not be a new organizational descriptor. On the contrary, if the binary concatenated word is taken as a whole, if the preceding word and the following word are more variable, the degree of freedom of the binary concatenated word can be considered to be higher, and the binary concatenated word may be a new organization descriptor and can be freely combined with other surrounding words.
Here, the above description about X, X continues1、x2、P(x)、P(x1) And P (x)2) Can be based on the purpose in the following method in some alternative implementationsAnd (3) calculating the degree of freedom free (x) of the binary spliced word x by the labeled word sequence set:
firstly, generating a preamble adjacent word set Pre corresponding to a binary spliced word x by using each participle which is positioned in front of the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of a target participle sequence setx。
Second, statistics is carried out on the adjacent word set Pre of the preamblexThe word frequency p (y) of each word y in the target set of word sequences.
Thirdly, generating a subsequent adjacent word set Post corresponding to the binary spliced word x by using each participle which is positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx。
Fourthly, counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of sequences of words.
And fifthly, calculating the degree of freedom free (x) of the binary spliced word x according to the following formula.
Free(x)=min(H(Prex),H(Postx) Equation 8)
As can be seen from the above description and from equations 6, 7 and 8, H (Pre)x) Namely the preamble adjacent word set Pre corresponding to the binary concatenation word xxIs reflected by the preface adjacent word set Pre corresponding to the binary concatenated word xxThe degree of variation of (a) can also be understood as the degree of freedom of the participle before the binary concatenated word x. H (Post)x) Namely the Post adjacent word set Post corresponding to the binary spliced word xxIs reflected by the Post-adjacent word set Post corresponding to the binary concatenated word xxThe degree of variation of (a) can also be understood as the degree of freedom of the participle after the binary concatenated word x. The degree of freedom free (x) of the binary concatenation word x is H (Pre)x) And H (Post)x) The smaller value of the two-dimensional concatenation word x is the smaller value of the degree of freedom free (x) of the two-dimensional concatenation word, which is reflected by the smaller value of the degree of change of the corresponding preceding adjacent word set and the degree of change of the following adjacent word set of the two-dimensional concatenation word. When the degree of freedom free (x) of the binary concatenated word x is larger, it indicates that the degree of change of the word before and after the binary concatenated word x is higher, i.e. the degree of freedom of the binary concatenated word x in combination with other words is higher, and the probability that the binary concatenated word x is a new organizational descriptor is higher.
Here, the execution subject may determine whether the binary spliced word satisfies each condition of a preset new word discovery condition group. If it is determined that the binary concatenation word is satisfied, the binary concatenation word may be determined to be a new organizational descriptor. Wherein the preset new word discovery condition group may include at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
The explanation in substep 2041 continues, and let T be assumedp、TaAnd TfThe preset word frequency threshold, the preset freezing degree threshold, and the preset degree of freedom threshold are respectively, and the preset new word discovery condition group may include at least one of the following conditions:
the first condition is as follows: p (x)>Tp;
And a second condition: aggloration (x)>Ta;
And (3) carrying out a third condition: free (x)>Tf。
In practice, the preset word frequency threshold, the preset freezing degree threshold and the preset degree of freedom threshold may be manually set by a technician according to experience and stored in the execution body.
As can be seen from the description of sub-step 2041, if each condition in the preset new word discovery condition set is satisfied, indicating that the binary conjunct x has a high probability of being a new organizational descriptor, the binary conjunct may be determined as a new organizational descriptor.
In some alternative implementations, the preset tissue discovery period recorded in step 201 may be predetermined by the period determination step shown in fig. 3. Referring to fig. 3, fig. 3 shows a flow 300 of one embodiment of the duration determination step according to the present disclosure. The time length determining step comprises the following steps:
here, the execution subject of the duration determination step may be the same as the execution subject of the above-described new organization descriptor recognition method. In this way, the execution subject of the duration determination step may store the determined preset tissue discovery duration in the local execution subject after determining the preset tissue discovery duration, and read the determined preset tissue discovery duration during the execution of the new tissue descriptor recognition method.
Here, the execution subject of the time length determination step may also be different from the execution subject of the above-described new organization descriptor recognition method. In this way, the execution main body of the time length determination step may send the determined preset tissue discovery time length to the execution main body of the new tissue descriptor recognition method locally after determining the preset tissue discovery time length. In this way, the executing body of the new organization descriptor recognition method may read the received preset organization discovery period in the process of executing the new organization descriptor recognition method.
Here, the preset candidate duration set may be a set consisting of at least one candidate duration. The time units of the candidate durations may be the same or different. For example, the time unit of the candidate duration may be day, hour, or both day and hour. As an example, the preset candidate duration set may be {1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 7 days }.
Here, the execution subject of the duration determination step may execute the recognition accuracy determination operation for each candidate duration in the preset candidate duration set, and specifically, the recognition accuracy determination operation may include sub-steps 3011 to 3015:
sub-step 3011, obtain the historical text set related to the description organization generated in the candidate duration recently, and the corresponding set of annotation new organization descriptors.
In practice, a new organization descriptor set for describing a new organization that has not occurred historically may be manually marked out from the historical text set related to the description organization generated within the candidate duration.
Here, assuming that the candidate duration is 3 days in the preset candidate duration set of the above example, here in sub-step 3011, a historical text set related to the description organization generated in the last 3 days and a corresponding new organization descriptor set are obtained.
And a substep 3012 of performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing.
Here, how to cut words of the text to obtain the word segmentation sequence may refer to the related description in step 202, and is not described herein again.
And a substep 3013, generating a binary concatenated word library corresponding to the candidate duration by using a binary concatenated word composed of two adjacent segmented words in the segmented word sequence set corresponding to the candidate duration.
Sub-step 3014, for each binary-spliced word in the binary-spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom, and the degree of solidity of the binary-spliced word based on the set of participle sequences corresponding to the candidate duration, and in response to determining that the binary-spliced word satisfies each condition in the set of preset new word discovery conditions and the binary-spliced word belongs to the set of tagged new organizational descriptors, or in response to determining that the binary-spliced word does not satisfy at least one condition in the set of preset new word discovery conditions and the binary-spliced word does not belong to the set of tagged new organizational descriptors, determining the binary-spliced word as the recognized correct word.
Here, the execution subject of the duration determination step may determine, for each binary-spliced word in the binary-spliced word library corresponding to the candidate duration generated in sub-step 3013, the binary-spliced word as the correct word in response to determining that the binary-spliced word satisfies each condition in the preset new word discovery condition group and that the binary-spliced word belongs to the annotated new organizational descriptor set, or in response to determining that the binary-spliced word does not satisfy at least one condition in the preset new word discovery condition group and that the binary-spliced word does not belong to the annotated new organizational descriptor set. That is, if a new word discovery condition set is preset as described above, the binary concatenated word is a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is also a new organization descriptor, it may be considered that a condition set is found according to a preset new word, the binary concatenated word is correctly identified, and the binary concatenated word may be determined as an identified correct word. Similarly, if the new word discovery condition set is preset according to the above, the binary concatenated word is not a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is not a new organization descriptor, it may be considered that a condition set is found according to a preset new word, the binary concatenated word is also correctly identified, and the binary concatenated word may be determined as an identified correct word. Otherwise, if the condition set is found according to the preset new words, the binary concatenated words are new organization descriptors. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is not a new organization descriptor, it may be considered that a condition set is found according to a preset new word, and if the binary concatenated word is recognized incorrectly, the binary concatenated word may be determined as a recognized incorrect word. Similarly, if the new word discovery condition set is preset according to the above, the binary concatenated word is not a new organization descriptor. Meanwhile, according to the labeled new organization descriptor set obtained in the substep 3011, if the binary concatenated word is a new organization descriptor, it may be considered that a condition group is found according to a preset new word, the binary concatenated word is also recognized as an error, and the binary concatenated word may be determined as a recognized error word.
Sub-step 3015, determining the ratio of the number of correct recognized words in the binary concatenated word library corresponding to the candidate duration to the number of binary concatenated words in the binary concatenated word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration.
Since it has been determined in sub-step 3014 whether each binary-spliced word in the binary-spliced word bank corresponding to the candidate duration is an identified correct word, a ratio of the number of identified correct words in the binary-spliced word bank corresponding to the candidate duration divided by the number of binary-spliced words in the binary-spliced word bank corresponding to the candidate duration may be determined as the identification accuracy corresponding to the candidate duration in sub-step 3015.
After step 301, the identification accuracy corresponding to each candidate duration in the preset candidate duration set is determined, where the candidate duration with the highest identification accuracy in the preset candidate duration set may be determined as the preset tissue discovery duration.
The preset tissue discovery duration determined according to the duration determining step shown in fig. 3 can be used for acquiring the historical text set related to the description tissue generated in the preset tissue discovery duration determined according to the duration determining step shown in fig. 3 when acquiring the recent tissue description related historical text set in the process of executing the new organizing word recognition method, and because the preset tissue discovery duration determined according to the duration determining step shown in fig. 3 is the corresponding preset candidate duration set with the highest recognition accuracy, the historical text set related to the description tissue generated in a longer time is not required to be acquired in order to improve the recognition accuracy, the calculation amount is reduced, and then the calculation efficiency and the recognition effect can be considered.
The method provided by the above embodiment of the present disclosure obtains a recent tissue description related history text set that is generated within a recent preset tissue discovery duration and is related to a description tissue. And generating a binary concatenation word library corresponding to the recent organization description related historical text set. And finally, for each binary spliced word in the binary spliced word library, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set, and if the binary spliced word is determined to meet each condition in a preset new word discovery condition group, determining the binary spliced word as a new organization descriptor. According to the method for identifying the new organization descriptors, the whole process does not need manual operation, and the labor cost and the time cost for discovering the new organization descriptors are reduced.
With further reference to fig. 4, a flow 400 of yet another embodiment of a new organizational descriptor recognition method is shown. The process 400 of the new organization descriptor recognition method includes the following steps:
In this embodiment, the specific operation and the technical effect of step 401 are substantially the same as those of step 201 in the embodiment shown in fig. 2, and are not repeated herein.
In this embodiment, the executing body of the new organization descriptor recognition method may adopt a dictionary-based word segmentation method, perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set acquired in step 401 based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing.
In practice, the dictionary-based word segmentation method may include a forward maximum matching method, a reverse maximum matching method, and a bidirectional matching word segmentation method according to different scanning directions. The dictionary-based word segmentation method may refer to matching a word string to be analyzed (e.g., each recent organization description related history text in the recent organization description related history text set in step 402) with entries in a preset word segmentation dictionary according to a certain policy, segmenting the word string into words if the word string exists in the dictionary, and then performing matching of a next word string.
And 403, generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in the target word segmentation sequence set.
And step 404, executing identification operation on each binary concatenated word in the binary concatenated word library.
In this embodiment, the specific operations of step 403 and step 404 and the technical effects thereof are substantially the same as the operations and effects of step 203 and step 204 in the embodiment shown in fig. 2, and are not repeated herein.
And 405, adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into a preset word segmentation dictionary.
In this embodiment, the execution subject may add each binary concatenated word determined as the new organization descriptor in step 404 in the binary concatenated word library generated in step 403 to the preset word segmentation dictionary. Therefore, when the new organization descriptor recognition method is executed again next time, the new organization descriptor recognized this time is already added into the preset word segmentation dictionary, namely the preset word segmentation dictionary is updated, and the new organization descriptor recognized this time will not be recognized as a new organization descriptor next time.
It should be noted that the preset word segmentation dictionary may be obtained by gradually adding new organization descriptors on the basis of the general word segmentation dictionary.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the new organization descriptor recognition method in this embodiment has more steps to update the preset word dictionary. Therefore, the scheme described in this embodiment can update the preset word segmentation dictionary in real time, so that when the new organization descriptor is identified next time, because the word identified as the new organization descriptor at this time is already added into the preset word segmentation dictionary, the word identified as the new organization descriptor once will not be identified as the new organization descriptor again in the future.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a new organization descriptor recognition apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the new organization descriptor recognition apparatus 500 of the present embodiment includes: an acquisition unit 501, a first generation unit 502, a second generation unit 503, and a recognition unit 504. The acquiring unit 501 is configured to acquire a recent tissue description related history text set, where the recent tissue description related history text set is a history text set related to a description tissue generated within a recent preset tissue discovery duration; a first generating unit 502, configured to perform word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; a second generating unit 503 configured to generate a binary concatenated word library by using a binary concatenated word composed of two adjacent segmented words in the target segmented word sequence set; the identifying unit 504 is configured to perform the following identifying operation for each binary concatenated word in the binary concatenated word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
In this embodiment, specific processes of the obtaining unit 501, the first generating unit 502, the second generating unit 503, and the identifying unit 504 of the new organization descriptor identifying apparatus 500 and technical effects brought by the specific processes may refer to related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional embodiments, the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence may include: performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and the apparatus 500 may further include: and an adding unit 505 configured to add each binary-spliced word determined as a new organization descriptor in the binary-spliced word library to the preset word segmentation dictionary.
In some optional embodiments, the preset tissue discovery duration may be predetermined by the following duration determination steps: for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration; and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
In some optional embodiments, the calculating, for each binary concatenated word in the binary concatenated word library, a word frequency, a degree of freedom, and a degree of solidity of the binary concatenated word based on the target word segmentation sequence set may include: for each word X in the binary concatenation lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x: counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, and performing word segmentation x1Word frequency P (x) in the target word sequence set1) And word segmentation x2Word frequency P (x) in the target word sequence set2) (ii) a The coagulation degree agglomerization (x) of the binary concatenated word x is calculated according to the following formula:
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx(ii) a Counting the above preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set; generating a Post-order adjacent word set Post corresponding to the binary spliced word x by using each participle positioned behind the binary spliced word x and adjacent to the binary spliced word x in each participle sequence of the target participle sequence setx(ii) a Counting the Post adjacent word set PostxThe word frequency P (z) of each word z in the target word segmentation sequence set; the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Free(x)=min(H(Prex),H(Postx))
it should be noted that details of implementation and technical effects of each unit in the new organization descriptor recognition apparatus provided in the present disclosure may refer to descriptions of other embodiments in the present disclosure, and are not described herein again.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing the electronic device of the present disclosure is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the present disclosure.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input section 606 including a touch screen, a tablet, a keyboard, a mouse, or the like; an output section 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication section 609. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, Python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in this disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first generation unit, a second generation unit, and a recognition unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the obtaining unit may also be described as a "unit that obtains a recent organization description related history text set".
As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a latest preset organization discovery duration and is related to a description organization; performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing; generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set; for each binary concatenation word in the binary concatenation word library, executing the following identification operation: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Claims (10)
1. A new organizational descriptor recognition method, comprising:
acquiring a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical alarm receiving and handling text set which is generated within a latest preset organization discovery duration and is related to a description organization;
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generating a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing;
generating a binary concatenation word library by using a binary concatenation word formed by two adjacent word segmentations in a target word segmentation sequence in the target word segmentation sequence set;
for each binary concatenated word in the binary concatenated word library, performing the following identification operations: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
2. The method of claim 1, wherein the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence comprises:
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and
the method further comprises the following steps:
and adding each binary concatenation word determined as a new organization descriptor in the binary concatenation word library into the preset word segmentation dictionary.
3. The method according to claim 1 or 2, wherein the preset tissue discovery duration is predetermined by the duration determination step of:
for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration;
and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
4. The method of claim 3, wherein the calculating, for each binary spliced word in the binary spliced thesaurus, the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the target word segmentation sequence set comprises:
for each participle X in the binary concatenated lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x:
counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, wherein the word segmentation x1Word frequency P (x) in the target set of partial word sequences1) And word segmentation x2Word frequency P (x) in the target set of partial word sequences2);
The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx;
Counting the preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set;
generating a Post-order adjacent word set Post corresponding to the binary concatenated word x by using each participle positioned behind the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx;
Counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of word sequences;
the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Free(x)=min(H(Prex),H(Postx)) 。
5. a new tissue descriptor recognition apparatus comprising:
the obtaining unit is configured to obtain a recent organization description related historical text set, wherein the recent organization description related historical text set is a historical text set which is generated within a recent preset organization discovery duration and is related to a description organization;
the first generation unit is configured to perform word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set to obtain a corresponding word segmentation sequence, and generate a target word segmentation sequence set by using each word segmentation sequence obtained after the word segmentation processing;
a second generation unit configured to generate a binary concatenated word library using a binary concatenated word composed of two adjacent participles in a target participle sequence in the target participle sequence set;
the recognition unit is configured to execute the following recognition operation for each binary concatenation word in the binary concatenation word library: calculating the word frequency, the degree of freedom and the degree of solidity of the binary spliced word based on the target word segmentation sequence set, and determining the binary spliced word as a new organization descriptor in response to determining that the binary spliced word meets each condition in a preset new word discovery condition set, wherein the preset new word discovery condition set comprises at least one of the following conditions: the word frequency of the binary spliced word is larger than a preset word frequency threshold value, the degree of solidification of the binary spliced word is larger than a preset degree of solidification threshold value, and the degree of freedom of the binary spliced word is larger than a preset degree of freedom threshold value.
6. The apparatus of claim 5, wherein the performing word segmentation processing on each recent organization description related history text in the recent organization description related history text set to obtain a corresponding word segmentation sequence comprises:
performing word segmentation processing on each recent organization description related historical text in the recent organization description related historical text set based on a preset word segmentation dictionary to obtain a corresponding word segmentation sequence; and
the device further comprises:
and the adding unit is configured to add each binary spliced word determined as the new organization descriptor in the binary spliced word library into the preset word segmentation dictionary.
7. The apparatus according to claim 5 or 6, wherein the preset tissue discovery duration is predetermined by the duration determination step of:
for each candidate duration in the preset set of candidate durations, performing the following identification accuracy determination operations: acquiring a historical text set which is generated in the candidate duration and is related to the description organization and a corresponding labeled new organization descriptor set; performing word segmentation processing on each historical text in the acquired historical text set to obtain a corresponding word segmentation sequence, and generating a word segmentation sequence set corresponding to the candidate duration by using each word segmentation sequence obtained after the word segmentation processing; generating a binary spliced word library corresponding to the candidate duration by using a binary spliced word formed by two adjacent participles in the participle sequence set corresponding to the candidate duration; for each binary spliced word in the binary spliced word library corresponding to the candidate duration, calculating the word frequency, the degree of freedom and the degree of solidification of the binary spliced word based on the segmentation sequence set corresponding to the candidate duration, and in response to determining that the binary spliced word satisfies each condition in the preset new word discovery condition set and the binary spliced word belongs to the tagged new organization descriptor set, or in response to determining that the binary spliced word does not satisfy at least one condition in the preset new word discovery condition set and the binary spliced word does not belong to the tagged new organization descriptor set, determining the binary spliced word as a correct word; determining the ratio of the number of the correct recognized words in the binary concatenation word library corresponding to the candidate duration to the number of the binary concatenation words in the binary concatenation word library corresponding to the candidate duration as the recognition accuracy corresponding to the candidate duration;
and determining the corresponding candidate duration with the highest identification accuracy in the preset candidate duration set as the preset tissue discovery duration.
8. The apparatus of claim 7, wherein the calculating, for each binary spliced word in the binary spliced thesaurus, a word frequency, a degree of freedom, and a degree of solidity of the binary spliced word based on the target word sequence set comprises:
for each participle X in the binary concatenated lexicon X1And participle x2And (3) executing the following calculation operation on the spliced binary spliced word x:
counting the word frequency P (x) of the binary spliced word x in the target word segmentation sequence set, wherein the word segmentation x1Word frequency P (x) in the target set of partial word sequences1) And word segmentation x2Word frequency P (x) in the target set of partial word sequences2);
The coagulation degree Aglomeration (x) of the binary spliced word x is calculated according to the following formula:
generating a preamble adjacent word set Pre corresponding to the binary concatenated word x by using each participle which is positioned in front of the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx;
Counting the preamble adjacent word set PrexThe word frequency P (y) of each word y in the target word segmentation sequence set;
generating a Post-order adjacent word set Post corresponding to the binary concatenated word x by using each participle positioned behind the binary concatenated word x and adjacent to the binary concatenated word x in each participle sequence of the target participle sequence setx;
Counting Post adjacent word set PostxThe word frequency p (z) of each word z in the target set of word sequences;
the degree of freedom free (x) of the binary concatenated word x is calculated according to the following formula:
Free(x)=min(H(Prex),H(Postx)) 。
9. an electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010435003.3A CN112329458B (en) | 2020-05-21 | 2020-05-21 | New organization descriptor recognition method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010435003.3A CN112329458B (en) | 2020-05-21 | 2020-05-21 | New organization descriptor recognition method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112329458A true CN112329458A (en) | 2021-02-05 |
CN112329458B CN112329458B (en) | 2024-05-10 |
Family
ID=74302841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010435003.3A Active CN112329458B (en) | 2020-05-21 | 2020-05-21 | New organization descriptor recognition method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112329458B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103020251A (en) * | 2012-12-20 | 2013-04-03 | 人民搜索网络股份公司 | Automatic mining system and method of news events in large-scale data |
US20150079554A1 (en) * | 2012-05-17 | 2015-03-19 | Postech Academy-Industry Foundation | Language learning system and learning method |
CN104951428A (en) * | 2014-03-26 | 2015-09-30 | 阿里巴巴集团控股有限公司 | User intention recognition method and device |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
US20160098383A1 (en) * | 2014-10-07 | 2016-04-07 | International Business Machines Corporation | Implicit Durations Calculation and Similarity Comparison in Question Answering Systems |
CN106776542A (en) * | 2016-11-23 | 2017-05-31 | 北京小米移动软件有限公司 | The crucial word treatment method of field feedback, device and server |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN108319582A (en) * | 2017-12-29 | 2018-07-24 | 北京城市网邻信息技术有限公司 | Processing method, device and the server of text message |
CN109408818A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | New word identification method, device, computer equipment and storage medium |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN110457595A (en) * | 2019-08-01 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Emergency event alarm method, device, system, electronic equipment and storage medium |
CN111147905A (en) * | 2019-12-31 | 2020-05-12 | 深圳Tcl数字技术有限公司 | Media resource searching method, television, storage medium and device |
CN111159557A (en) * | 2019-12-31 | 2020-05-15 | 北京奇艺世纪科技有限公司 | Hotspot information acquisition method, device, server and medium |
-
2020
- 2020-05-21 CN CN202010435003.3A patent/CN112329458B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150079554A1 (en) * | 2012-05-17 | 2015-03-19 | Postech Academy-Industry Foundation | Language learning system and learning method |
CN103020251A (en) * | 2012-12-20 | 2013-04-03 | 人民搜索网络股份公司 | Automatic mining system and method of news events in large-scale data |
CN104951428A (en) * | 2014-03-26 | 2015-09-30 | 阿里巴巴集团控股有限公司 | User intention recognition method and device |
US20160098383A1 (en) * | 2014-10-07 | 2016-04-07 | International Business Machines Corporation | Implicit Durations Calculation and Similarity Comparison in Question Answering Systems |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN106776542A (en) * | 2016-11-23 | 2017-05-31 | 北京小米移动软件有限公司 | The crucial word treatment method of field feedback, device and server |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN108319582A (en) * | 2017-12-29 | 2018-07-24 | 北京城市网邻信息技术有限公司 | Processing method, device and the server of text message |
CN109408818A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | New word identification method, device, computer equipment and storage medium |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN110457595A (en) * | 2019-08-01 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Emergency event alarm method, device, system, electronic equipment and storage medium |
CN111147905A (en) * | 2019-12-31 | 2020-05-12 | 深圳Tcl数字技术有限公司 | Media resource searching method, television, storage medium and device |
CN111159557A (en) * | 2019-12-31 | 2020-05-15 | 北京奇艺世纪科技有限公司 | Hotspot information acquisition method, device, server and medium |
Non-Patent Citations (1)
Title |
---|
介飞: "社交网络中隐式事件突发性检测", 自动化学报, vol. 44, no. 04, 11 December 2017 (2017-12-11), pages 730 - 742 * |
Also Published As
Publication number | Publication date |
---|---|
CN112329458B (en) | 2024-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10755048B2 (en) | Artificial intelligence based method and apparatus for segmenting sentence | |
CN114612759B (en) | Video processing method, video query method, model training method and model training device | |
CN111061877A (en) | Text theme extraction method and device | |
CN111368551A (en) | Method and device for determining event subject | |
CN111626054B (en) | Novel illegal action descriptor recognition method and device, electronic equipment and storage medium | |
CN111078849A (en) | Method and apparatus for outputting information | |
CN113590756A (en) | Information sequence generation method and device, terminal equipment and computer readable medium | |
CN110675865B (en) | Method and apparatus for training hybrid language recognition models | |
CN113111233A (en) | Regular expression-based method and device for extracting residential address of alarm receiving and processing text | |
CN113111167A (en) | Method and device for extracting vehicle model of alarm receiving and processing text based on deep learning model | |
CN112329458B (en) | New organization descriptor recognition method and device, electronic equipment and storage medium | |
WO2022148239A1 (en) | Method and apparatus for information output, and electronic device | |
CN108628909B (en) | Information pushing method and device | |
CN113111230B (en) | Regular expression-based alarm receiving text home address extraction method and device | |
CN115098729A (en) | Video processing method, sample generation method, model training method and device | |
CN111666449B (en) | Video retrieval method, apparatus, electronic device, and computer-readable medium | |
CN112131874A (en) | New group descriptor recognition method and device, electronic device and storage medium | |
CN113111174A (en) | Group identification method, device, equipment and medium based on deep learning model | |
CN114490400A (en) | Method and device for processing test cases | |
CN114066603A (en) | Post-loan risk early warning method and device, electronic equipment and computer readable medium | |
CN113239259A (en) | Method and device for determining similar stores | |
CN109308299B (en) | Method and apparatus for searching information | |
CN111626053B (en) | New scheme means descriptor recognition method and device, electronic equipment and storage medium | |
CN113094499A (en) | Deep learning model-based organization identification method and device, equipment and medium | |
CN112650830B (en) | Keyword extraction method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |