CN111027281B

CN111027281B - Word segmentation method, device, equipment and storage medium

Info

Publication number: CN111027281B
Application number: CN201911140087.1A
Authority: CN
Inventors: 任为
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-06-06
Anticipated expiration: 2039-11-20
Also published as: CN111027281A

Abstract

The embodiment of the disclosure discloses a word segmentation method, a word segmentation device, a word segmentation equipment and a word segmentation storage medium. Comprising the following steps: creating at least one word segmentation entity in the word segmentation area; the word segmentation entity carries an identification code; acquiring an identification code of a target word segmentation entity corresponding to the word segmentation area according to triggering operation of a user; determining a marking pattern according to the identification code of the target word segmentation entity; and marking the word segmentation area and the target word segmentation entity according to the marking style. According to the word segmentation method provided by the embodiment of the disclosure, at least one word segmentation entity carrying the identification code is created in the word segmentation area, and when a user triggers the word segmentation area, the word segmentation area and the target word segmentation entity are marked according to the determined marking style, so that a plurality of word segmentation entities are created in the same area of the rich text, and the reliability of word segmentation is improved.

Description

Word segmentation method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of rich text word segmentation, in particular to a word segmentation method, device, equipment and storage medium.

Background

The traditional word segmentation is realized by creating an Entity for a word segmentation area in rich text, and the disadvantage of the method is that a plurality of words are created for the same area, if the Entity already exists in the word segmentation area, the currently created word segmentation Entity can cover the part overlapped with the existing Entity. There is thus a need for a method that can be implemented to create multiple score entries in the same region.

Disclosure of Invention

The embodiment of the disclosure provides a word segmentation method, device, equipment and storage medium, so as to realize the creation of a plurality of word segmentation entities in the same region of rich text and improve the reliability of word segmentation.

In a first aspect, an embodiment of the present disclosure provides a word segmentation method, including:

creating at least one word segmentation entity in the word segmentation area by adopting a set rule; the word segmentation entity carries an identification code;

acquiring an identification code of a target word segmentation entity corresponding to the word segmentation area according to triggering operation of a user;

determining a marking pattern according to the identification code of the target word segmentation entity;

and marking the word segmentation area and the target word segmentation entity according to the marking style.

In a second aspect, an embodiment of the present disclosure further provides a word segmentation apparatus, including:

the word segmentation entity creation module is used for creating at least one word segmentation entity in the word segmentation area by adopting a set rule; the word segmentation entity carries an identification code;

the word segmentation entity identification code acquisition module is used for acquiring the identification code of the target word segmentation entity corresponding to the word segmentation area according to the triggering operation of the user;

the marking pattern determining module is used for determining a marking pattern according to the identification code of the target word segmentation entity;

and the marking module is used for marking the word segmentation area and the target word segmentation entity according to the marking style.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processing devices;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement the method of word segmentation as described in embodiments of the present disclosure.

In a fourth aspect, the embodiments of the present disclosure further provide a computer readable medium having a computer program stored thereon, which when executed by a processing device, implements the method of word segmentation as described in the embodiments of the disclosure.

According to the embodiment of the disclosure, at least one word segmentation entity is firstly created in a word segmentation area, the word segmentation entity carries an identification code, then the identification code of a target word segmentation entity corresponding to the word segmentation area is obtained according to triggering operation of a user, then a marking pattern is determined according to the identification code of the target word segmentation entity, and finally the word segmentation area and the target word segmentation entity are marked according to the marking pattern. According to the word segmentation method provided by the embodiment of the disclosure, at least one word segmentation entity carrying the identification code is created in the word segmentation area, and when a user triggers the word segmentation area, the word segmentation area and the target word segmentation entity are marked according to the determined marking style, so that a plurality of word segmentation entities are created in the same area of the rich text, and the reliability of word segmentation is improved.

Drawings

FIG. 1 is a flow chart of a word segmentation method in accordance with a first embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a word segmentation device in a second embodiment of the disclosure;

fig. 3 is a schematic diagram of the result of an electronic device in a third embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units. [ ordinal words ]

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Example 1

Fig. 1 is a flowchart of a word segmentation method provided in a first embodiment of the present disclosure, where the embodiment of the present disclosure may be applicable to a case of segmentation of rich text, where the method may be performed by a word segmentation apparatus, where the apparatus may be composed of hardware and/or software, and may generally be integrated in a device having a word segmentation function, where the device may be an electronic device such as a server, a mobile terminal, or a server cluster. As shown in fig. 1, the method specifically comprises the following steps:

at step 110, at least one word segmentation entity is created in the word segmentation area.

Wherein the Entity of the word segmentation (Entity) carries an identification code. The function of the word segmentation entity is to annotate a piece of text and carry some additional information. The word segmentation area may be an area in which a piece of text is included in the rich text, and at least one character is included in the area. Specifically, at least one word segmentation entity is created in the word segmentation area by using a set rule, wherein the set rule can be an inline style inlinetyperanges. In the embodiment of the disclosure, the inline styles are characterized by overlapping rendering, so that a plurality of word segmentation entities can be created in the same area, namely, the newly created word segmentation entities cannot cover the original entities. The identification code is carried in the style field of the word segmentation entity, and the identification code can be a custom field.

Specifically, the manner of creating at least one word segmentation entity in the word segmentation area may be: creating at least one initial word segmentation entity in the word segmentation area; and adding annotation information in the initial word segmentation entity to obtain the word segmentation entity.

The annotation information is an annotation of the text in the word segmentation area. The initial word segmentation entity includes additional information such as an identification code of the entity. The annotation information may be a textual explanation of the text of the scoring area or a web page link associated with the text of the scoring area. In the embodiment of the disclosure, a plurality of word entities may be created in the same word segmentation area, for example, two initial word entities are created first, annotation information in text form is added in one of the word entities, and annotation information in link form is added in the other initial word entity. The user can jump to the corresponding webpage by clicking the link of the word segmentation entity, so that the content corresponding to the link is obtained.

And 120, acquiring the identification code of the target word segmentation entity corresponding to the word segmentation area according to the triggering operation of the user.

The triggering operation can be a triggering operation detected by the terminal when the user clicks the word segmentation area by using a mouse or a finger. In the embodiment of the disclosure, the word segmentation entity created by adopting the inline style cannot respond to the click event of the user, and the draft editor leaf. Act. Js file in the rich text needs to be modified at this time, so that the word segmentation area in which the word segmentation entity is created can respond to the click event of the user. In the application scene, after a DraftEdittorLeaf. Act. Js file in a rich text is modified, when a user wants to view annotation information of a certain word segmentation area, the word segmentation area needs to be clicked, and after a trigger operation generated by clicking the word segmentation area by the user is detected, the terminal acquires an identification code of a target word segmentation entity corresponding to the word segmentation area according to the trigger operation.

Specifically, the manner of acquiring the identification code of the target word entity corresponding to the word segmentation area according to the triggering operation of the user may be: acquiring at least one word segmentation entity corresponding to the word segmentation area according to the triggering operation of the user; and determining the first sequencing of the at least one word segmentation entity as a target word segmentation entity, and acquiring an identification code of the target word segmentation entity.

Wherein ordering first may understand that creation time is ordered first. In the embodiment of the disclosure, if the clicked word segmentation area includes a plurality of word segmentation entities, the plurality of word segmentation entities are obtained first, then the word segmentation entity with the first creation time is determined to be the target word segmentation entity, and the identification code of the target word segmentation entity is obtained.

And 130, determining a marking pattern according to the identification code of the target word segmentation entity.

Wherein the marking pattern may be a highlighting pattern. In the embodiment of the disclosure, the manner of determining the marking pattern according to the identification code of the target marking entity may be to dynamically insert a programming sentence containing the marking pattern into the target marking entity by using a script programming language (e.g., javascript).

Specifically, the manner of determining the marking pattern according to the identification code of the target word segmentation entity may be: converting the identification code of the target word-dividing entity into the identification code with a set format; and analyzing the identification code with the set format to obtain the marking pattern.

Wherein the set format may be a cascading style. Specifically, the identification code of the target word segmentation entity is converted into the identification code of the cascading style, and then the identification code of the cascading style is analyzed by adopting an event actual trigger element currentTarget to obtain the marking style. In the embodiment of the disclosure, the implementation process of the program may be that the identification code of the target word segmentation entity is converted into a class name of the cascading style, and transmitted to the leaf node, and an event function is called, after the class name of the cascading style is resolved by adopting the event function, a programming sentence containing a marking style is dynamically inserted by adopting a script programming language, so as to mark the word segmentation area and the target word segmentation entity.

And 140, marking the word segmentation area and the target word segmentation entity according to the marking style.

Specifically, after the marking pattern is determined, marking the marking area and the target marking entity according to the marking pattern. Such as: highlighting the word segmentation area and the target word segmentation entity to highlight the word segmentation area.

According to the technical scheme, at least one word-dividing entity is created in a word-dividing area by adopting a set rule, the word-dividing entity carries an identification code, then the identification code of a target word-dividing entity corresponding to the word-dividing area is obtained according to triggering operation of a user, then a marking pattern is determined according to the identification code of the target word-dividing entity, and finally the word-dividing area and the target word-dividing entity are marked according to the marking pattern. According to the word segmentation method provided by the embodiment of the disclosure, at least one word segmentation entity carrying the identification code is created in the word segmentation area, and when a user triggers the word segmentation area, the word segmentation area and the target word segmentation entity are marked according to the determined marking style, so that a plurality of word segmentation entities are created in the same area of the rich text, and the reliability of word segmentation is improved.

Example two

Fig. 2 is a schematic structural diagram of a word segmentation device according to a second embodiment of the disclosure. As shown in fig. 2, the apparatus includes: a word entity creation module 210, a word entity identification code acquisition module 220, a marking pattern determination module 230 and a marking module 240.

A word segmentation entity creation module 210 that creates at least one word segmentation entity in the word segmentation region; the word segmentation entity carries an identification code;

the marking entity identification code obtaining module 220 is configured to obtain an identification code of a target marking entity corresponding to the marking area according to a triggering operation of a user;

a marking pattern determining module 230, configured to determine a marking pattern according to the identification code of the target word segmentation entity;

the marking module 240 is configured to mark the word segmentation area and the target word segmentation entity according to the marking style.

Optionally, the word segmentation entity creation module 210 is further configured to:

creating at least one initial word segmentation entity in the word segmentation area;

and adding annotation information in the initial word segmentation entity to obtain a word segmentation entity, wherein the annotation information is annotation on the text in the word segmentation area.

Optionally, the entity identification code obtaining module 220 is further configured to:

acquiring at least one word segmentation entity corresponding to the word segmentation area according to the triggering operation of the user;

and determining the first sequencing of the at least one word segmentation entity as a target word segmentation entity, and acquiring an identification code of the target word segmentation entity.

Optionally, the marking pattern determining module 230 is further configured to:

converting the identification code of the target word-dividing entity into the identification code with a set format;

and analyzing the identification code with the set format to obtain the marking pattern.

converting the identification code of the target word segmentation entity into the identification code of the cascading style;

analyzing the identification code with the set format to obtain a marking pattern, comprising the following steps:

and analyzing the identification code of the cascading style by adopting an event actual trigger element currentTarget to obtain a marking style.

Optionally, the marking pattern comprises a highlighting pattern.

at least one word segmentation entity is created in the word segmentation area using an inline style inlinetypeRangs.

The device can execute the method provided by all the embodiments of the disclosure, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the embodiments of the present disclosure can be found in the methods provided by all of the foregoing embodiments of the present disclosure.

Example III

Referring now to fig. 3, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), etc., as well as fixed terminals such as digital TVs, desktop computers, etc., or various forms of servers such as stand-alone servers or server clusters. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with programs stored in a read-only memory (ROM) 302 or programs loaded from a storage 305 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing a recommended method of words. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 309, or installed from storage means 305, or installed from ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: creating at least one word segmentation entity in the word segmentation area by adopting a set rule; the word segmentation entity carries an identification code; acquiring an identification code of a target word segmentation entity corresponding to the word segmentation area according to triggering operation of a user; determining a marking pattern according to the identification code of the target word segmentation entity; and marking the word segmentation area and the target word segmentation entity according to the marking style.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, the embodiments of the present disclosure provide a word segmentation method, including:

creating at least one word segmentation entity in the word segmentation area; the word segmentation entity carries an identification code;

Further, creating at least one word segmentation entity in the word segmentation area, comprising:

creating at least one initial word segmentation entity in the word segmentation area by adopting a set rule;

Further, acquiring the identification code of the target word segmentation entity corresponding to the word segmentation area according to the triggering operation of the user, including:

acquiring at least one word segmentation entity corresponding to the word segmentation area according to triggering operation of a user;

and determining the first sequencing determination in the at least one word segmentation entity as a target word segmentation entity, and acquiring an identification code of the target word segmentation entity.

Further, determining a marking pattern according to the identification code of the target word segmentation entity comprises:

converting the identification code of the target word segmentation entity into an identification code with a set format;

and analyzing the identification code with the set format to obtain a marking pattern.

Further, converting the identification code of the target word segmentation entity into the identification code with a set format comprises the following steps:

converting the identification code of the target word segmentation entity into an identification code of a cascading style;

analyzing the identification code with the set format to obtain a marking pattern, wherein the marking pattern comprises the following steps:

Further, the marking pattern includes a highlighting pattern.

Further, creating at least one word segmentation entity in the word segmentation area using the set rule includes:

Note that the above is only a preferred embodiment of the present disclosure and the technical principle applied. Those skilled in the art will appreciate that the present disclosure is not limited to the specific embodiments described herein, and that various obvious changes, rearrangements and substitutions can be made by those skilled in the art without departing from the scope of the disclosure. Therefore, while the present disclosure has been described in connection with the above embodiments, the present disclosure is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the present disclosure, the scope of which is determined by the scope of the appended claims.

Claims

1. A method of word segmentation, comprising:

analyzing the identification code of the cascading style by adopting an event actual trigger element to obtain a marking style;

2. The method of claim 1, wherein creating at least one scoring entity in the scoring area comprises:

3. The method of claim 1, wherein obtaining the identification code of the target word segmentation entity corresponding to the word segmentation region according to the triggering operation of the user comprises:

4. A method according to any one of claims 1-3, wherein the marking pattern comprises a highlighting pattern.

5. The method of claim 1, wherein creating at least one scoring entity in the scoring area using the set rules comprises:

at least one word segmentation entity is created in the word segmentation area using an inline style.

6. A word segmentation apparatus, comprising:

the word segmentation entity creation module is used for creating at least one word segmentation entity in the word segmentation area; the word segmentation entity carries an identification code;

the marking module is used for marking the word segmentation area and the target word segmentation entity according to the marking pattern;

the marking pattern determining module is specifically configured to:

and analyzing the identification code of the cascading style by adopting an event actual trigger element to obtain a marking style.

7. An electronic device, the electronic device comprising:

one or more processing devices;

a storage means for storing one or more programs;

when the one or more programs are executed by the one or more processing devices, the one or more processing devices are caused to implement the method of word segmentation as claimed in any one of claims 1-5.

8. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processing device, implements a word segmentation method according to any one of claims 1-5.