CN110309315B - Template file generation method and device, computer readable medium and electronic equipment - Google Patents

Template file generation method and device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN110309315B
CN110309315B CN201810367499.8A CN201810367499A CN110309315B CN 110309315 B CN110309315 B CN 110309315B CN 201810367499 A CN201810367499 A CN 201810367499A CN 110309315 B CN110309315 B CN 110309315B
Authority
CN
China
Prior art keywords
entity
corpus data
preset
names
template file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810367499.8A
Other languages
Chinese (zh)
Other versions
CN110309315A (en
Inventor
周辉阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810367499.8A priority Critical patent/CN110309315B/en
Publication of CN110309315A publication Critical patent/CN110309315A/en
Application granted granted Critical
Publication of CN110309315B publication Critical patent/CN110309315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a template file generation method, a template file generation device, a computer readable medium and electronic equipment. The generating method comprises the following steps: detecting a preset entity name contained in corpus data; determining a target entity label corresponding to a preset entity name according to the corresponding relation between the entity name and the entity label; replacing a preset entity name contained in the corpus data by the target entity label to generate a template file of the corpus data; if a plurality of preset entity names with overlapped characters exist in the corpus data, the corresponding entity names in the corpus data are replaced by target entity labels corresponding to the preset entity names, so that a plurality of template files of the corpus data are generated. The technical scheme of the embodiment of the invention can avoid the problems that when the entity names with overlapped characters appear, the corresponding template file is generated only aiming at one entity name, so that the template file is generated incompletely and the inaccurate template file is possibly generated.

Description

Template file generation method and device, computer readable medium and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a template file, a computer readable medium, and an electronic device.
Background
In the natural language processing process, a good template is very important for the corpus in one field, the generalization and usability of the template can be guaranteed, but how to extract a proper template file from massive user query data is a difficult problem, and no effective solution exists at present.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
The embodiment of the invention provides a template file generation method, a template file generation device, a computer readable medium and electronic equipment, and further solves the problem that a comprehensive template file cannot be obtained in the prior art at least to a certain extent.
Other features and advantages of the invention will be apparent from the following detailed description, or may be learned by the practice of the invention.
According to an aspect of an embodiment of the present invention, there is provided a method for generating a template file, including: detecting a preset entity name contained in corpus data; determining a target entity label corresponding to the preset entity name according to the corresponding relation between the entity name and the entity label; replacing the preset entity names contained in the corpus data by the target entity labels to generate template files of the corpus data; if a plurality of preset entity names with overlapped characters exist in the corpus data, replacing the corresponding entity names in the corpus data by target entity labels corresponding to the preset entity names respectively to generate a plurality of template files of the corpus data.
According to an aspect of an embodiment of the present invention, there is provided a template file generating apparatus, including: the first detection unit is used for detecting preset entity names contained in the corpus data; the determining unit is used for determining a target entity label corresponding to the preset entity name according to the corresponding relation between the entity name and the entity label; the generating unit is used for replacing the preset entity names contained in the corpus data through the target entity labels so as to generate a template file of the corpus data; the generating unit is further configured to, when a plurality of preset entity names with overlapping characters exist in the corpus data, replace corresponding entity names in the corpus data by target entity tags corresponding to the plurality of preset entity names, so as to generate a plurality of template files of the corpus data.
According to an aspect of the embodiments of the present invention, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a template file generating method as described in the above embodiments.
According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; and a storage device for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the method for generating a template file as described in the above embodiments.
In the technical solutions provided in some embodiments of the present invention, a target entity tag corresponding to a preset entity name included in corpus data is determined according to a correspondence between entity names and entity tags, and the preset entity names included in the corpus data are replaced by the target entity tag, so that a template file of the corpus data can be generated by an automatic matching manner. When a plurality of preset entity names with overlapped characters exist in the corpus data, the corresponding entity names in the corpus data are replaced by target entity labels corresponding to the plurality of preset entity names, so that corresponding template files can be generated for different preset entity names, and the problems that when the preset entity names with overlapped characters exist, the corresponding template files are generated only for one of the preset entity names, so that the template files are generated incompletely and inaccurate template files can be generated are avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 shows a schematic diagram of an exemplary system architecture of a template file generation method or template file generation apparatus to which embodiments of the present invention may be applied;
FIG. 2 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention;
FIG. 3 schematically illustrates a flow chart of a method of generating a template file according to one embodiment of the invention;
FIG. 4 schematically illustrates a flow chart of a method of generating a template file according to another embodiment of the invention;
FIG. 5 schematically illustrates a flow chart of a method of generating a template file according to yet another embodiment of the invention;
FIG. 6 schematically illustrates a flow chart of a method of generating a template file;
FIG. 7 schematically illustrates a flow chart of a method of generating a template file according to yet another embodiment of the invention;
FIG. 8 schematically illustrates a block diagram of a template file generation apparatus according to one embodiment of the present invention;
fig. 9 schematically shows a block diagram of a template file generating apparatus according to another embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which a template file generation method or a template file generation apparatus of an embodiment of the present invention may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired communication links, wireless communication links, and the like.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices with display screens including, but not limited to, smartphones, tablet computers, portable computers, desktop computers, and the like.
The server 105 may be a server providing various services. For example, the server 105 collects the corpus data (such as query sentences) sent by the user using the terminal device 103 (or the terminal device 101 or 102), then detects a preset entity name contained in the corpus data, and further replaces the entity name contained in the corpus data according to an entity tag corresponding to the entity name, so as to generate a template file of the corpus data.
In one embodiment of the present invention, if there are a plurality of preset entity names with overlapping characters in the corpus data (for example, "Liu Jiayi" and "Liu Jia" with overlapping characters appear in a movie of the corpus data "Liu Jiayi", and "Liu Jiayi" and "Liu Jia" are both preset entity names), the server 105 may replace corresponding entity names in the corpus data with entity labels corresponding to the plurality of preset entity names, so as to generate a plurality of template files, and further ensure the comprehensiveness of the generated template files. For example, if the entity tag corresponding to the entity name "Liu Jiayi" is "actor" and the entity tag corresponding to the entity name "Liu Jia" is "director", two template files are generated: movies of [ actor ]; the movie of [ director ] b.
In one embodiment of the present invention, after generating a plurality of template files, the server 105 may select an appropriate template file from the plurality of template files as a final template file, for example, in the above example, the server 105 may select a movie "of the template" [ actor ] as the final template file through a corresponding selection policy, thereby facilitating obtaining an optimal template file.
It should be noted that, the method for generating the template file according to the embodiment of the present invention is generally executed by the server 105, and accordingly, the generating device of the template file is generally disposed in the server 105. However, in other embodiments of the present invention, the terminal may also have a similar function to the server, so as to execute the template file generation scheme provided in the embodiments of the present invention.
Fig. 2 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
It should be noted that, the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present invention.
As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU) 201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data required for the system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other through a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.
The following components are connected to the I/O interface 205: an input section 206 including a keyboard, a mouse, and the like; an output portion 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 208 including a hard disk or the like; and a communication section 209 including a network interface card such as a LAN card, a modem, and the like. The communication section 209 performs communication processing via a network such as the internet. The drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 210 as needed, so that a computer program read out therefrom is installed into the storage section 208 as needed.
In particular, according to embodiments of the present invention, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 209, and/or installed from the removable medium 211. When executed by a Central Processing Unit (CPU) 201, performs the various functions defined in the system of the present application.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 3 to 7.
The implementation details of the technical scheme of the embodiment of the invention are described in detail below:
fig. 3 schematically shows a flowchart of a method of generating a template file according to an embodiment of the invention, which method is applicable to the electronic device described in the previous embodiment. Referring to fig. 3, the method for generating the template file at least includes steps S310 to S330, which are described in detail as follows:
in step S310, a preset entity name included in the corpus data is detected.
In one embodiment of the present invention, corpus data refers to natural language data actually used by a user in an actual application scenario. The entity represents a basic unit of a concept, and the entity name is a word formed by the entity.
In step S320, a target entity tag corresponding to the preset entity name is determined according to the correspondence between the entity name and the entity tag.
In one embodiment of the present invention, the entity tag is used to identify the category to which the entity name belongs, for example, identify that the entity name belongs to "actor" or "director", etc.
In step S330, replacing the preset entity name included in the corpus data by the target entity tag to generate a template file of the corpus data; if a plurality of preset entity names with overlapped characters exist in the corpus data, replacing the corresponding entity names in the corpus data by target entity labels corresponding to the preset entity names respectively to generate a plurality of template files of the corpus data.
The technical solution of the embodiment shown in fig. 3 enables the generation of template files of material data by means of automatic matching. When a plurality of preset entity names with overlapped characters exist in the corpus data, the corresponding entity names in the corpus data are replaced by target entity labels corresponding to the plurality of preset entity names, so that corresponding template files can be generated for different preset entity names, and the problems that when the preset entity names with overlapped characters exist, the corresponding template files are generated only for one of the preset entity names, so that the template files are generated incompletely and inaccurate template files can be generated are avoided.
Implementation details of the template file generation method shown in fig. 3 are described in detail in two embodiments:
in one embodiment of the present invention, detecting a preset entity name included in corpus data in step S310 includes: and detecting the position information of the preset entity name in the corpus data.
In one embodiment of the present invention, the location information of the preset entity name in the corpus data may be determined according to the characters contained in the corpus data. Such as "movies of Liu Jiayi and Zhang Bingding" for corpus data, then the location information of the entity name "Liu Jiayi" is [0|3], i.e., location 0 is a start location (excluding location 0), and location 3 is an end location.
In one embodiment of the present invention, after detecting the position information of the preset entity name in the corpus data, the character corresponding to the position information in the corpus data may be replaced by the target entity tag. For example, for the above example, if the entity tag corresponding to the entity name "Liu Jiayi" is "actor", the characters beginning at position 0 and ending at position 3 in the movies of the corpus data "Liu Jiayi and Zhang Bingding" are replaced by "actor".
In one embodiment of the present invention, the location information of the preset entity name in the corpus data and the target entity tag corresponding to the preset entity name may be detected and returned by an AC automaton (Aho-Corasick automaton, a multi-mode matching algorithm). For example, for the above example, the AC automaton returns the result of [0|3|actor ], i.e., the entity tag corresponding to the entity name represented by the character that illustrates the beginning at position 0 and the end at position 3 is "actor". Therefore, the technical scheme of the embodiment enables the position information of the preset entity name in the corpus data and the target entity label corresponding to the preset entity name to be detected and returned through one AC automaton, and improves the matching efficiency of the algorithm.
In one embodiment of the present invention, as described above, the position information of the preset entity name in the corpus data is the position information of the characters included in the preset entity name arranged in the corpus data in the first order (e.g., left-to-right order). On the basis, if the corpus data contains a plurality of non-overlapping preset entity names, when the preset entity names in the corpus data are replaced by the target entity labels, the non-overlapping preset entity names can be replaced in sequence according to the sequence of the non-overlapping preset entity names in the corpus data according to a second sequence (the second sequence is opposite to the first sequence).
In this embodiment, for example, the corpus data is "movies of Liu Jiayi and Zhang Bingding". If the AC automaton returns the following result: [ 0.sub.3.sub.actor ], [ 4.sub.7.sub.actor ]. If the positions of 0-3 (in left-to-right order, and without position 0) in the corpus data are changed to "actor", then "[ actor ] and Zhang Bingding movies" are obtained, then the positions of the entity names "Zhang Bingding" will change, and are no longer the positions of the original 4-7 (without position 4), and if the replacement is continued, then the wrong template file will be obtained. Based on the technical scheme of the embodiment of the invention, the method can be replaced according to the sequence of entity names of Liu Jiayi and Zhang Bingding from right to left, namely, the positions of 4-7 in the corpus data are replaced by "actor", so as to obtain films of Liu Jiayi and [ actor ], and then the positions of 0-3 in the corpus data are replaced by "actor", so as to obtain films of [ actor ] and [ actor ]. Therefore, according to the technical scheme of the first embodiment, when a plurality of non-overlapping entity names appear in the corpus data, the replacement of the entity labels is ensured not to be problematic, and further an accurate template file can be obtained.
In another embodiment of the present invention, detecting a preset entity name included in corpus data in step S310 includes: and detecting character content of a preset entity name contained in the corpus data.
In one embodiment of the present invention, since the character content of the preset entity name contained in the corpus data is directly detected, the character content in the corpus data can be replaced by the target entity tag. For example, the corpus data is "movies of Liu Jiayi and Zhang Bingding", the detected character content of the preset entity names is "Liu Jiayi" and "Zhang Bingding", and the entity labels corresponding to the preset entity names "Liu Jiayi" and "Zhang Bingding" are both "actor", so that "Liu Jiayi" and "Zhang Bingding" in the corpus data can be directly replaced by "actor" to obtain movies of "[ actor ] and [ actor ]. Therefore, the technical solution of the second embodiment can also ensure that the replacement of the entity label does not have a problem when a plurality of non-overlapping entity names appear in the corpus data, so that an accurate template file can be obtained.
In an embodiment of the present invention, for the technical solution of the second embodiment, the first AC automaton may detect and return the character content of the preset entity name contained in the corpus data, and determine, by the second AC automaton, the target entity tag corresponding to the preset entity name, so as to ensure that an accurate template file is obtained by directly replacing the character content.
In one embodiment of the present invention, as shown in fig. 4, the method for generating a template file according to another embodiment of the present invention further includes, on the basis of step S310 and step S320 shown in fig. 3:
step S410, if there are a plurality of preset entity names with overlapped characters in the corpus data, judging whether the entity labels corresponding to the plurality of preset entity names are the same, if so, executing step S420; otherwise, step S430 is performed.
Step S420, the entity names with the largest characters in the preset entity names are replaced by the entity labels corresponding to the preset entity names, so as to generate the template file of the corpus data.
In one embodiment of the present invention, for example, "Liu Jiayi" and "Liu Jia", and "Liu Jiayi" and "Liu Jia" where the character overlap occurs in the movie "of the corpus data" Liu Jiayi "are all preset entity names, and entity labels corresponding to the entity names" Liu Jiayi "and" Liu Jia "are both" actor ", so that" Liu Jiayi "in the corpus data can be replaced by" actor ".
Step S430, replacing the corresponding entity names in the corpus data by the target entity labels corresponding to the preset entity names, so as to generate a plurality of template files of the corpus data.
In one embodiment of the present invention, for example, in the movie of the corpus data "Liu Jiayi," Liu Jiayi "and" Liu Jia "where the characters overlap," Liu Jiayi "and" Liu Jia "are all preset entity names, the entity label corresponding to the entity name" Liu Jiayi "is" actor "and the entity label corresponding to the entity name" Liu Jia "is" director ", two template files are generated: movies of [ actor ]; the movie of [ director ] b.
According to the technical scheme of the embodiment shown in fig. 4, when a plurality of preset entity names with overlapped characters exist in corpus data and entity labels corresponding to the plurality of preset entity names are the same, the entity name with the largest number of characters can be selected for replacement, so that more accurate template files can be obtained; and when the entity labels corresponding to the preset entity names are different, the entity labels can be replaced respectively to generate a plurality of template files, so that the overall template files can be ensured to be obtained.
Based on the technical solution of the foregoing embodiment, as shown in fig. 5, the method for generating a template file according to still another embodiment of the present invention further includes the following steps:
step S510, detecting whether any one of the preset entity names is contained in the generated plurality of template files, or whether the non-overlapping portion of any two entity names in the preset entity names is contained.
In one embodiment of the present invention, for example, for the corpus data "Liu Jiayi and Zhang Bingding movies", the entity label corresponding to the entity name "Liu Jiayi" is "actor"; the entity label corresponding to the entity name Liu Jia is a director; the entity label corresponding to the entity name Zhang Bingding is an actor; the entity label corresponding to the entity name "Zhang Bing" is "director". The following template file may be obtained through the technical solution of the above embodiment of the present invention: movies of [ actor ] and Zhang Bingding; movies of [ actor ] and [ actor ]; movies of [ actor ] and [ director ] delta; liu Jiayi and [ actor ]; liu Jiayi and [ director ] t; movies of [ director ] b and Zhang Bingding; movies of [ director ] B and [ actor ]; movies of [ director ] b and [ director ] t. The part of the template file contains an entity name of Liu Jiayi or Zhang Bingding, or contains a non-overlapping part of Liu Jiayi and Liu Jia, or contains a non-overlapping part of Zhang Bingding and Zhang Bing.
Step S520, if any one of the template files contains any one of the preset entity names or contains the non-overlapping portion of any two of the preset entity names, deleting any one of the template files from the plurality of template files, so as to filter the plurality of template files.
In one embodiment of the present invention, as in the above example, since "Liu Jiayi" and "Zhang Bingding" are entity names, the template files containing these two entity names are obviously inaccurate, and for the template files containing "b" and "t", it is obvious that "Liu Jia" and "Zhang Bing" are directly replaced, without considering "Liu Jiayi" and "Zhang Bingding" which are more likely, and therefore, the part of the template files needs to be deleted, so that more prepared template files can be ensured.
Implementation details of the technical solution of the embodiment of the present invention are described in detail below with reference to fig. 6 and 7.
As shown in fig. 6, in a method for generating a template file, the method includes the following steps:
step S601, performing character traversal on the language data.
Step S602, judging whether the current character can be matched with the entity name in the database, if so, executing step S603; otherwise, step S604 is performed.
In step S603, the entity name is replaced by the entity tag.
Step S604, judging whether traversing to the end, if yes, determining that a template file is obtained; otherwise, the process returns to step S601 to continue the traversal.
The technical scheme shown in fig. 6 is that the material data is subjected to violent matching of character strings one by one, and the entity name is replaced by the entity label during matching, so that the scheme has the following problems:
1. for example, "who the wife of Liu Jiayi is" if entity names "Liu Jia" and "Liu Jiayi" are present at the same time and the entity labels corresponding to both entity names are singer, the template obtained according to the scheme shown in fig. 6 is "[ singer ] who the wife of b is.
2. Assuming that the entity tag corresponding to the entity name "Liu Jia" is actor and the entity tag corresponding to the entity name "Liu Jiayi" is singer, the template obtained according to the scheme shown in fig. 6 is: "who is the wife of the" actor "b", no useful template is generated at all "[ who is the wife of the" singer "), i.e., in the case where there is an overlap in entity names, only one template can be obtained, and the full permutation and combination of all possible templates cannot be obtained.
In view of the above problems, the embodiment of the present invention provides a solution, specifically as shown in fig. 7, including step S701 and step S702, which are described in detail below:
in step S701, the AC automaton performs multimode matching.
In one embodiment of the present invention, step S701 may include the following procedure when specifically implemented: establishment of an AC automaton, solutions of various problems occurring through a multimode matching algorithm, and full arrangement of various entity address conflicts.
1. Establishment of AC automaton
In one embodiment of the present invention, in order to improve the data computing efficiency, spark (a computing engine) may be selected to process massive data, and an AC automaton may be built in the spark to load related entity names and entity tags, such as names of preset massive actors and directors as entity names, and set tags of an actor and a director, respectively.
The return of the AC automaton is typically "start|end|label", which represents the start position, end position, and entity tag of the entity name in the match, respectively. If a plurality of entity names are matched in a piece of corpus data, and the first matched entity name is replaced, the position of the subsequent entity name is changed, and the absolute position value returned by the AC automaton is further caused to lose meaning. Such as "movies of corpus data Liu Jiayi and Zhang Bingding", if the AC automaton returns the result: [ 0.sub.3.sub.actor ], [ 4.sub.7.sub.actor ]. If the positions of 0-3 (in left-to-right order, and without position 0) in the corpus data are changed to "actor", then "[ actor ] and Zhang Bingding movies" are obtained, then the positions of the entity names "Zhang Bingding" will change, and are no longer the positions of the original 4-7 (without position 4), and if the replacement is continued, then the wrong template file will be obtained. To solve this problem, in one embodiment of the present invention, two AC automata may be built, one for returning the entity name in the matching, and the other for returning the entity tag, and finally replacing the tag according to the entity name, so that the problem of replacing the entity name and the entity tag is successfully solved.
In another embodiment of the present invention, the replacement may be performed according to the order of the entity names "Liu Jiayi" and "Zhang Bingding" from right to left, that is, the positions of 4-7 in the corpus data are replaced with "actor" to obtain "Liu Jiayi and [ actor ] movies", and then the positions of 0-3 in the corpus data are replaced with "actor" to obtain "[ actor ] and [ actor ] movies", so that the problem of replacing entity names and entity labels may be solved.
2. Solution to various problems arising through multimode matching algorithms
In an embodiment of the present invention, various problems that occur with multimode matching algorithms include: the address conflict problem of the same kind of entity and the address conflict problem of different kinds of entities are respectively introduced as follows:
2.1 problem of Address Conflict for similar entities
In one embodiment of the present invention, taking the example of "movies of corpus data Liu Jiayi and Zhang Bingding", assuming that the entity labels corresponding to entity names Liu Jiayi and Liu Jia are actor, the return of the AC automaton is: [ 0.sub.2.sub.actor ], [ 0.sub.3.sub.actor ]. Through analysis of the processing of a large number of statements, it can be found that for the same type of address conflict problem, it is often a reasonable choice to choose the address longer, such as for this example, replacing "Liu Jiayi" by an entity tag is a very reasonable choice. The following conclusions can thus be drawn: and selecting a long entity for replacement when the addresses of the entities of the same kind conflict.
2.2 problem of Address Conflict for different types of entities
In one embodiment of the present invention, taking the example of the "movies of the corpus data Liu Jiayi and Zhang Bingding", if the entity label corresponding to the entity name Liu Jiayi is actor and the entity label corresponding to the entity name Liu Jia is director, the return of the AC automaton is: [ 0.sub.2.sub.director ], [ 0.sub.3.sub.actor ]. Although the addresses of the entity names conflict, the corresponding entity labels are not identical, so that the two templates can be considered to be equal probability, and two templates can be generated: movies of [ actor ] and Zhang Bingding, [ director ] B and Zhang Bingding.
3. Full permutation of multiple physical address conflicts
In one embodiment of the present invention, taking the example of the corpus data "Liu Jiayi and Zhang Bingding movies", if the entity label corresponding to the entity name "Liu Jiayi" is "actor"; the entity label corresponding to the entity name Liu Jia is a director; the entity label corresponding to the entity name Zhang Bingding is an actor; the entity label corresponding to the entity name "Zhang Bing" is "director". The following template file may be obtained through the technical solution of the above embodiment of the present invention: movies of [ actor ] and Zhang Bingding; movies of [ actor ] and [ actor ]; movies of [ actor ] and [ director ] delta; liu Jiayi and [ actor ]; liu Jiayi and [ director ] t; movies of [ director ] b and Zhang Bingding; movies of [ director ] B and [ actor ]; movies of [ director ] b and [ director ] t.
In step S702, template selection is performed.
In one embodiment of the present invention, if the candidate template contains the related entity name or the set of differences between the entity names of the field (for example, "Liu Jiayi" and "Liu Jia" are "b" and "Zhang Bingding" and "Zhang Bing" are "t"), the template is filtered. For example, "Liu Jiayi", "Zhang Bingding", "Liu Jia", "Zhang Bing", "b", "t" cannot be contained in the above templates, and thus the template files obtained after filtering are: movies of [ actor ] and [ actor ].
The technical scheme of the embodiment of the invention solves the problem of entity position conflict of the same entity tag, solves the problem of entity position conflict of different entity tags, and solves all possible full arrangement problems. Meanwhile, the technical scheme of the embodiment of the invention can conveniently excavate the template file in the newly built field, can promote the uncovered semantic support and template excavation of the old field, and can recall more corpora in a certain field to increase the richness of the domain corpora.
The following describes an embodiment of the apparatus of the present invention, which may be used to execute the template file generating method in the above embodiment of the present invention. For details not disclosed in the embodiment of the apparatus of the present invention, please refer to the embodiment of the method for generating a template file according to the present invention.
Fig. 8 schematically shows a block diagram of a template file generating apparatus according to an embodiment of the present invention.
Referring to fig. 8, a template file generating apparatus 800 according to an embodiment of the present invention includes: a first detection unit 801, a determination unit 802, and a generation unit 803.
The first detection unit 801 is configured to detect a preset entity name included in corpus data; the determining unit 802 is configured to determine, according to a correspondence between entity names and entity tags, a target entity tag corresponding to the preset entity name; the generating unit 803 is configured to replace the preset entity name included in the corpus data by the target entity tag, so as to generate a template file of the corpus data; the generating unit 803 is further configured to, when there are multiple preset entity names with overlapping characters in the corpus data, replace corresponding entity names in the corpus data with target entity labels corresponding to the multiple preset entity names, so as to generate multiple template files of the corpus data.
Referring to fig. 9, the template file generating apparatus 900 according to another embodiment of the present invention further includes, on the basis of having the first detecting unit 801, the determining unit 802, and the generating unit 803 shown in fig. 8: a second detection unit 901 and a deletion unit 902.
Wherein, the second detecting unit 901 is configured to detect, after the generating unit 803 generates a plurality of template files of the corpus data, whether any one of the preset entity names is included in the plurality of template files, or whether a non-overlapping portion of any two entity names in the preset entity names is included;
the deleting unit 902 is configured to delete any template file from the plurality of template files when the second detecting unit 901 detects that any template file contains any one of the preset entity names or a non-overlapping portion of any two entity names in the preset entity names, so as to filter the plurality of template files.
In one embodiment of the present invention, the template file generating apparatus shown in fig. 8 and 9 may further include: and a judging unit. The judging unit is used for judging whether the entity labels corresponding to the preset entity names are the same or not; the generating unit 803 is configured to replace, when the entity labels corresponding to the plurality of preset entity names are different, the corresponding entity names in the corpus data by the target entity labels corresponding to the plurality of preset entity names, respectively.
In one embodiment of the present invention, based on the foregoing scheme, the generating unit 803 is further configured to: and when the entity labels corresponding to the preset entity names are the same, replacing the entity name with the largest character number in the preset entity names by the entity label corresponding to the preset entity names so as to generate a template file of the corpus data.
In one embodiment of the present invention, based on the foregoing scheme, the first detection unit 801 is configured to: and detecting the position information of the preset entity name in the corpus data.
In one embodiment of the present invention, based on the foregoing scheme, the generating unit 803 is configured to: and replacing characters corresponding to the position information in the corpus data by the target entity tag.
In one embodiment of the present invention, based on the foregoing solution, the location information is location information of characters included in the preset entity name in the corpus data in a first order; the generating unit 803 is configured to: when the corpus data contains a plurality of non-overlapping preset entity names, the plurality of non-overlapping preset entity names are replaced in sequence according to the sequence of the plurality of non-overlapping preset entity names in the corpus data according to a second sequence, wherein the first sequence is opposite to the second sequence.
In one embodiment of the present invention, based on the foregoing scheme, the AC automaton detects and returns the location information of the preset entity name included in the corpus data, and the target entity tag corresponding to the preset entity name.
In one embodiment of the present invention, based on the foregoing scheme, the first detection unit 801 is configured to: and detecting character content of a preset entity name contained in the corpus data.
In one embodiment of the present invention, based on the foregoing scheme, the generating unit 803 is configured to: and replacing the character content in the corpus data by the target entity tag.
In one embodiment of the present invention, based on the foregoing scheme, character content of a preset entity name contained in the corpus data is detected and returned by a first AC automaton, and a target entity tag corresponding to the preset entity name is determined by a second AC automaton.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (12)

1. The template file generating method is characterized by comprising the following steps:
detecting position information of a preset entity name contained in corpus data in the corpus data, wherein the position information is the position information of characters contained in the preset entity name in the corpus data which are arranged according to a first sequence, and the position information comprises a starting position and an ending position corresponding to the characters contained in the preset entity name in the corpus data;
determining a target entity label corresponding to the preset entity name according to the corresponding relation between the entity name and the entity label;
replacing the preset entity names contained in the corpus data by the target entity labels to generate template files of the corpus data; if a plurality of preset entity names with overlapped characters exist in the same corpus data, replacing the corresponding entity names in the corpus data by target entity labels corresponding to the preset entity names respectively to generate a plurality of template files of the corpus data;
Detecting whether the template files contain non-overlapping parts of any two entity names in the preset entity names or not;
if any template file is detected to contain the non-overlapping part of any two entity names in the preset entity names, deleting any template file from the plurality of template files so as to filter the plurality of template files;
the step of replacing the preset entity name contained in the corpus data by the target entity tag to generate a template file of the corpus data comprises the following steps: if the corpus data contains a plurality of non-overlapping preset entity names, replacing characters at corresponding positions in the corpus data according to position information of the non-overlapping preset entity names in the corpus data and the sequence of the non-overlapping preset entity names in the corpus data according to a second sequence, and sequentially using target entity labels respectively corresponding to the non-overlapping preset entity names, wherein the first sequence is opposite to the second sequence.
2. The method of generating a template file according to claim 1, further comprising, after generating a plurality of template files of the corpus data:
Detecting whether any one of the preset entity names is contained in the template files;
if any template file is detected to contain any one of the preset entity names, deleting any template file from the plurality of template files so as to filter the plurality of template files.
3. The method for generating a template file according to claim 1, further comprising, before replacing the corresponding entity names in the corpus data by the target entity tags corresponding to the plurality of preset entity names, respectively:
judging whether the entity labels corresponding to the preset entity names are the same or not;
and if the entity labels corresponding to the preset entity names are different, triggering the step of replacing the corresponding entity names in the corpus data through the target entity labels corresponding to the preset entity names.
4. A method of generating a template file according to claim 3, further comprising:
if the entity labels corresponding to the preset entity names are the same, replacing the entity name with the largest character number in the preset entity names by the entity labels corresponding to the preset entity names so as to generate a template file of the corpus data.
5. The method for generating a template file according to claim 1, wherein replacing the preset entity name included in the corpus data by the target entity tag comprises:
and replacing characters corresponding to the position information in the corpus data by the target entity tag.
6. The method according to claim 1, wherein the AC automaton detects and returns the position information of the preset entity name contained in the corpus data, and the target entity tag corresponding to the preset entity name.
7. The method for generating a template file according to claim 1, wherein detecting location information of a preset entity name included in corpus data in the corpus data includes:
and detecting the position information of the character content of the preset entity name in the corpus data.
8. The method for generating a template file according to claim 7, wherein replacing the preset entity name included in the corpus data by the target entity tag comprises:
and replacing the character content in the corpus data by the target entity tag.
9. The method according to claim 7, wherein character content of a preset entity name contained in the corpus data is detected and returned by a first AC automaton, and a target entity tag corresponding to the preset entity name is determined by a second AC automaton.
10. A template file generating apparatus, comprising:
the first detection unit is used for detecting position information of a preset entity name contained in corpus data in the corpus data, wherein the position information is the position information of characters contained in the preset entity name in the corpus data which are arranged according to a first sequence, and the position information comprises a starting position and an ending position corresponding to the characters contained in the preset entity name in the corpus data;
the determining unit is used for determining a target entity label corresponding to the preset entity name according to the corresponding relation between the entity name and the entity label;
the generating unit is used for replacing the preset entity names contained in the corpus data through the target entity labels so as to generate a template file of the corpus data;
a second detecting unit, configured to detect whether a plurality of template files of the corpus data contain non-overlapping portions of any two entity names in the preset entity names after the generating unit generates the plurality of template files;
A deleting unit, configured to delete, when the second detecting unit detects that any template file contains a non-overlapping portion of any two entity names in the preset entity names, any template file from the plurality of template files, so as to filter the plurality of template files;
the generating unit is further configured to, when a plurality of preset entity names with overlapping characters exist in the same corpus data, replace corresponding entity names in the corpus data by target entity labels corresponding to the plurality of preset entity names, so as to generate a plurality of template files of the corpus data;
wherein the generating unit is configured to: if the corpus data contains a plurality of non-overlapping preset entity names, replacing characters at corresponding positions in the corpus data according to position information of the non-overlapping preset entity names in the corpus data and the sequence of the non-overlapping preset entity names in the corpus data according to a second sequence, and sequentially using target entity labels respectively corresponding to the non-overlapping preset entity names, wherein the first sequence is opposite to the second sequence.
11. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of generating a template file according to any one of claims 1 to 9.
12. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of generating a template file as claimed in any one of claims 1 to 9.
CN201810367499.8A 2018-04-23 2018-04-23 Template file generation method and device, computer readable medium and electronic equipment Active CN110309315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810367499.8A CN110309315B (en) 2018-04-23 2018-04-23 Template file generation method and device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810367499.8A CN110309315B (en) 2018-04-23 2018-04-23 Template file generation method and device, computer readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110309315A CN110309315A (en) 2019-10-08
CN110309315B true CN110309315B (en) 2024-02-02

Family

ID=68073888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810367499.8A Active CN110309315B (en) 2018-04-23 2018-04-23 Template file generation method and device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110309315B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046667B (en) * 2019-11-14 2024-02-06 深圳市优必选科技股份有限公司 Statement identification method, statement identification device and intelligent equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317839A (en) * 2014-10-10 2015-01-28 北京国双科技有限公司 Method and device for generating report form template
CN106910501A (en) * 2017-02-27 2017-06-30 腾讯科技(深圳)有限公司 Text entities extracting method and device
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317839A (en) * 2014-10-10 2015-01-28 北京国双科技有限公司 Method and device for generating report form template
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN106910501A (en) * 2017-02-27 2017-06-30 腾讯科技(深圳)有限公司 Text entities extracting method and device
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于云平台的防汛文档智能生成模型构建;姜鹏等;《水利信息化》;20130625(第03期);全文 *

Also Published As

Publication number Publication date
CN110309315A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
US10936821B2 (en) Testing and training a question-answering system
US10621211B2 (en) Language tag management on international data storage
CN111507086B (en) Automatic discovery of translated text locations in localized applications
US10929125B2 (en) Determining provenance of files in source code projects
CN109871311B (en) Method and device for recommending test cases
CN107592334A (en) A kind of information popularization method, apparatus and equipment
KR20190095099A (en) Transaction system error detection method, apparatus, storage medium and computer device
CN107783766A (en) The method and apparatus cleared up the file of application program
CN111563015A (en) Data monitoring method and device, computer readable medium and terminal equipment
CN111435367A (en) Knowledge graph construction method, system, equipment and storage medium
CN115599386A (en) Code generation method, device, equipment and storage medium
US9436460B2 (en) Regression alerts
Yu et al. Localizing function errors in mobile apps with user reviews
CN108694172B (en) Information output method and device
CN110309315B (en) Template file generation method and device, computer readable medium and electronic equipment
US11119761B2 (en) Identifying implicit dependencies between code artifacts
AU2016287770B2 (en) Frameworks and methodologies for enabling searching and/or categorisation of digitised information, including clinical report data
CN111753548A (en) Information acquisition method and device, computer storage medium and electronic equipment
CN116028597B (en) Object retrieval method, device, nonvolatile storage medium and computer equipment
CN113946517A (en) Abnormal data determination method and device, electronic equipment and storage medium
CN109101302B (en) File importing method and device, electronic equipment and storage medium
Khtira et al. FDDetector: A tool for deduplicating features in software product lines
CN117539837A (en) Model file storage method and device, electronic equipment and readable storage medium
CN116431481A (en) Code parameter verification method and device based on multi-code condition
CN114065727A (en) Information duplication eliminating method, apparatus and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant