CN115859975B

CN115859975B - Data processing method, device and equipment

Info

Publication number: CN115859975B
Application number: CN202310104834.6A
Authority: CN
Inventors: 吴晓烽; 王昊天; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-05-09
Anticipated expiration: 2043-02-07
Also published as: CN115859975A

Abstract

The embodiment of the specification provides a data processing method, a device and equipment, wherein the method comprises the following steps: acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process; determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation; correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result; and carrying out risk detection processing on the target data based on the second word segmentation result.

Description

Data processing method, device and equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, and device.

Background

With the rapid development of the internet industry, the number and types of network services are increasing, so are the network risks, and how to provide a safer network environment for users is a focus of attention of network service providers. In the wind control scene, risk detection processing can be carried out on service data through a pre-trained risk detection model, a risk detection result is obtained, and whether a certain service is executed with risk or not is determined based on the risk detection result.

However, because the data size of the service data is large and the data structure is complex, the risk detection efficiency and the detection accuracy are low due to the fact that the service data is directly detected, and therefore a solution capable of improving the risk detection efficiency and the risk detection accuracy in a wind control scene is needed.

Disclosure of Invention

The embodiment of the specification aims to provide a data processing method, device and equipment, so as to provide a solution capable of improving the risk detection efficiency and accuracy in a wind control scene.

In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:

in a first aspect, a data processing method includes: acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process; determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation; correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result; and carrying out risk detection processing on the target data based on the second word segmentation result.

In a second aspect, embodiments of the present disclosure provide a data processing apparatus, the apparatus comprising: the first acquisition module is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process; the first determining module is used for determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation; the result correction module is used for correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result; and the risk detection module is used for carrying out risk detection processing on the target data based on the second word segmentation result.

In a third aspect, embodiments of the present specification provide a data processing apparatus, the data processing apparatus comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process; determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation; correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result; and carrying out risk detection processing on the target data based on the second word segmentation result.

In a fourth aspect, embodiments of the present description provide a storage medium for storing computer-executable instructions that, when executed, implement the following: acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process; determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation; correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result; and carrying out risk detection processing on the target data based on the second word segmentation result.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1A is a flowchart illustrating an embodiment of a data processing method according to the present disclosure;

FIG. 1B is a schematic diagram illustrating a data processing method according to the present disclosure;

FIG. 2 is a schematic diagram illustrating a processing procedure of another data processing method according to the present disclosure;

FIG. 3 is a schematic diagram of a word segmentation group according to the present disclosure;

FIG. 4 is a schematic diagram of another word segmentation set according to the present disclosure;

FIG. 5 is a schematic diagram of an embodiment of a data processing apparatus according to the present disclosure;

fig. 6 is a schematic diagram of a data processing apparatus according to the present specification.

Detailed Description

The embodiment of the specification provides a data processing method, a device and equipment.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

Example 1

As shown in fig. 1A and fig. 1B, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, and the server may be an independent server or a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

in S102, a first word segmentation result corresponding to the target data is obtained.

The target data may be data generated in a human-computer interaction process, the target data may include data of different types, such as text data, picture data, audio data, etc., for example, the target data may be data required for executing a preset service, specifically, the preset service may be a resource transfer service, when a user triggers to execute the resource transfer service, service data related to executing the resource transfer service (such as resource transfer time, resource transfer number, etc.) may be determined as the target data, a first word segmentation result corresponding to the target data may be a result obtained by performing word segmentation processing on the target data based on a preset word segmentation algorithm, the preset word segmentation algorithm may include a forward maximum matching method (forward maximum matching method, FMM), a reverse maximum matching method (Backward maximummatching method, BMM), a bidirectional scanning method, a word segmentation algorithm based on a statistical model (HMM and n-gram), a word segmentation algorithm combining a rule method and a statistical method, etc., and the determining method of the first word segmentation result in the embodiment of the present specification is not limited specifically.

In implementation, with rapid development of the internet industry, the number and types of network services are increasing, so are the network risks, and how to provide a safer network environment for users is a focus of attention of network service providers. In the wind control scene, risk detection processing can be carried out on service data through a pre-trained risk detection model, a risk detection result is obtained, and whether a certain service is executed with risk or not is determined based on the risk detection result. However, because the data size of the service data is large and the data structure is complex, the risk detection efficiency and the detection accuracy are low due to the fact that the service data is directly detected, and therefore a solution capable of improving the risk detection efficiency and the risk detection accuracy in a wind control scene is needed. For this reason, the embodiments of the present specification provide a technical solution that can solve the above-mentioned problems, and specifically, reference may be made to the following.

In implementation, the terminal device may acquire data generated by a user in a man-machine interaction process, send the acquired data as target data to the server, and after receiving the target data, the server may perform word segmentation on the target data through a preset word segmentation algorithm to obtain a first word segmentation result corresponding to the target data. Or after the terminal equipment acquires the target data, the target data can be subjected to word segmentation processing through a preset word segmentation algorithm to obtain a first word segmentation result corresponding to the target data, and the first word segmentation result corresponding to the target data is sent to the server.

In addition, in the case that the target data includes non-text data (such as audio data, picture data, etc.), the terminal device may send the target data to the server, where the server performs text conversion processing on the non-text data in the target data based on a preset text conversion algorithm, and performs word segmentation processing on the target data after the text conversion processing to obtain a first word segmentation result corresponding to the target data. Or, the terminal device may perform text conversion processing on non-text data in the target data based on a preset text conversion algorithm, send the target data after the text conversion processing to the server, and perform word segmentation processing on the target data after the text conversion processing by the server to obtain a first word segmentation result corresponding to the target data.

Specifically, for example, assuming that the target data is page data of a target page to be detected, that is, the target data may include picture data, audio data and text data in the target page, the terminal device may send the obtained target data to a server for risk detection processing, and the server may perform text conversion processing on non-text data in the target data to obtain the target data after the text conversion processing. For example, the text conversion processing can be performed on the audio data in the target data through an automatic speech recognition technology (Automatic Speech Recognition, ASR) to obtain text data corresponding to the audio data, and meanwhile, the server can also extract the text data in the picture data through a text extraction algorithm to obtain the target data after the text conversion processing.

In S104, based on the plurality of characters included in each word in the first word segmentation result, an information value corresponding to each word is determined.

Wherein, the information value corresponding to the word segment can be used for representing the association strength among a plurality of characters included in each word segment.

In implementation, the first word segmentation result may be input into a pre-trained information value determination model, and an information value corresponding to each word segmentation in the first word segmentation result is determined, where the information value determination model may be obtained by training a model constructed by a preset machine learning algorithm through historical word segmentation.

Since the target data may include data obtained by the text conversion processing, there may be data in which text recognition is erroneous, for example, data obtained by the text conversion processing of the audio data may be included in the target data, and there may be recognition errors such as converting "segmentation" in the audio data into "segmentation" or the like at the time of the text conversion processing of the audio data due to problems such as accents.

Therefore, the association strength between the plurality of characters included in each word segment (i.e., the information value corresponding to each word segment) may be determined based on the plurality of characters included in each word segment, so as to perform correction processing on the first word segment result based on the information value corresponding to the word segment.

For example, assuming that the target data is "the word segmentation result obtained by the multiple processing", the first word segmentation result corresponding to the target data may include: a plurality of participles of "segmentation", "processing", "obtaining", "word segmentation processing" and "result", the plurality of tokens may be input into a pre-trained information value determination model to obtain an information value corresponding to each token.

The information value corresponding to each word segment may be used to represent the association strength between a plurality of characters included in each word segment, that is, the information value corresponding to "division" may be used to represent the association strength between "division" and "word", the information value corresponding to "word segment processing" may be used to represent the association strength between "word segment" and "processing", or the information value corresponding to "word segment processing" may also be used to represent the association strength between "division", "word", "place" and "processing", etc.

The method for determining the information value corresponding to the word segmentation is an optional and realizable method, and in an actual application scenario, there may be a plurality of different determining methods, and may be different according to the actual application scenario, which is not particularly limited in the embodiment of the present disclosure.

In S106, the first word segmentation result is modified based on the information value corresponding to each word segmentation and the phonetic symbol association relationship between the words, so as to obtain a second word segmentation result.

The phonetic symbol association relationship is the association relationship between the word segmentation determined based on the phonetic symbols of the word segmentation, and the phonetic symbols can be phonetic letters of Chinese, west phonetic symbols of English, international phonetic symbols and the like, and the symbols of phonemes can be recorded.

In implementation, the phonetic symbols corresponding to each word segment may be obtained, and based on the phonetic symbols of each word segment, the phonetic symbol association relationship between the word segments may be determined, for example, if the phonetic symbols between the word segments are the same, the phonetic symbol association relationship between the word segments may be determined, and specifically, if the phonetic symbols of the "division", "processing" and "word segment processing" are the same, the phonetic symbol association relationship between the 3 word segments may be determined.

The first word segmentation result can be corrected based on the information value corresponding to the word segmentation to obtain a second word segmentation result. For example, the word segmentation and the information value corresponding to the word segmentation with the phonetic symbol association relationship may be input into a pre-trained correction model to obtain a correction result, and a second word segmentation result may be obtained based on the first word segmentation result and the correction result, where the correction result may include: "split processing", then the second word result may be: the correction model can be constructed based on a preset machine learning algorithm and is used for correcting the word segmentation result based on phonetic symbol association relation and information value.

The method for determining the second word segmentation result is an optional and implementable method, and in an actual application scene, there may be a plurality of different determining methods, and different determining methods may be selected according to different actual application scenes, which is not specifically limited in the embodiment of the present disclosure.

In S108, risk detection processing is performed on the target data based on the second word segmentation result.

In implementation, different detection methods can be selected according to different application scenes corresponding to the target data, and risk detection processing is performed on the target data based on the second word segmentation result.

For example, the detection method may include a plurality of methods such as manual detection, model detection, and keyword matching detection, and the corresponding detection method may be selected according to an application scenario corresponding to the target data, for example, assuming that the target data is resource transfer data, the corresponding application scenario is resource transfer data detection scenario, and the risk detection processing may be performed on the target data by selecting the model detection method.

The embodiment of the specification provides a data processing method, which is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process, determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation, correcting the first word segmentation result based on the information value corresponding to each word segmentation and phonetic symbol association relation among the words, obtaining a second word segmentation result, and performing risk detection processing on the target data based on the second word segmentation result. Therefore, the second word segmentation result is obtained by correcting the first word segmentation result through the phonetic symbol association relation and the information value, so that the word segmentation accuracy of the target data can be improved, and the efficiency and the accuracy of risk detection processing of the target data can be improved under the conditions of large data volume and complex data structure of the target data.

Example two

As shown in fig. 2, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, where the server may be an independent server or may be a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

The target data may be data generated in a human-computer interaction process.

In S202, a plurality of second word segments having phonetic symbol association relation with the first word segment in the first word segment result is obtained.

The first word segmentation may be any word segmentation in the first word segmentation result.

In implementation, a phonetic symbol of each word segment in the first word segment result may be obtained, and a plurality of second word segments having a phonetic symbol association relationship with the first word segment may be determined based on the phonetic symbol of each word segment. For example, the word segments in the first word segment result may be sequentially selected as the first word segment until all the word segments in the first word segment result are screened to obtain a plurality of second word segments having phonetic symbol association relation with the first word segment. Specifically, assume that the first word segmentation result includes: the plurality of segmentations of "segmentation", "processing", "obtaining", "segmentation processing" and "result" may have a first segment "segmentation" as a first segment, and determining whether there are a plurality of word segments identical to the phonetic symbols of the "score" in the first word segment result, and if not, the second word is processed as the first word, and whether a plurality of words which are the same as the phonetic symbols processed in the first word result exist or not is determined until screening processing is carried out on all the words in the first word result, so that a plurality of second words which have phonetic symbol association relation with the first word are obtained. For example, the first word may be "word-segmentation" and the second word may include "segmentation" and "processing", i.e., the phonetic symbols of the "word-segmentation" are identical to the phonetic symbols of the combination of "segmentation" and "processing".

In S204, when the information value corresponding to the first word is greater than the information value corresponding to each of the plurality of second words, the plurality of second words are combined to obtain a third word.

Wherein the information value may be an intra-point mutual information value (Pointwise Mutual Information).

In implementation, the correlation (i.e., the association strength) between characters in each word segment may be measured by the PMI, i.e., the larger the PMI, the greater the association strength between characters included in the word segment. And combining the plurality of second words to obtain a third word when the information value corresponding to the first word is larger than the information value corresponding to each of the plurality of second words. For example, assuming that the first word is "word segmentation processing", the corresponding second word is "segmentation" and "processing", the PMI of the "word segmentation processing" may be a, the PMI of the "segmentation" may be b, and the PMI of the "processing" may be c, where a, b, c are greater than 0. When a is greater than b and c, the degree of association between characters in the "segmentation process" may be considered to be greater than the degree of association between characters in the "segmentation" and the degree of association between characters in the "process", and therefore, the "segmentation" and the "process" may be combined to obtain a third segmentation, that is, the third segmentation may be the "segmentation process".

The above-mentioned third word segmentation determining method is an optional determining method, in practical application, the above-mentioned processing manner of S204 may be various, and the following provides an optional implementation manner, and specifically, see the following steps A1-A4:

in step A1, the characters included in the plurality of second word segments are recombined to obtain multi-component word groups, and each word group includes a plurality of sub word segments for forming the plurality of second word segments.

In implementation, for example, assuming that the second word is "split" and "processed", the characters included in the plurality of second words are "split", "position" and "process", and the four characters may be recombined to obtain a multi-component phrase, and a part of the phrase may be as shown in fig. 3, and the sub-word included in each of the phrase may form the second word.

In step A2, based on a plurality of sub-word segments in the word segment group, an information value corresponding to the word segment group is determined.

The information values corresponding to the word segmentation groups are used for representing the association degree among a plurality of sub-word segments.

In implementation, the information value corresponding to the word segmentation group may be determined based on the sub-words included in each word segmentation group, for example, as shown in fig. 3, for the word segmentation group 1, the inter-point information value of the word segmentation group 1 may be determined by a PMI algorithm based on the four sub-words of "segmentation", "secondary", "place" and "process", and the inter-point information value of the word segmentation group 1 may be used as the information value of the word segmentation group 1, that is, the information value of the word segmentation group 1 may be used to characterize the degree of association between the four sub-words of "segmentation", "secondary", "place" and "process".

In step A3, when the information value corresponding to the first word segment is greater than the information value corresponding to each component phrase in the multi-component phrases, combining the plurality of second word segments to obtain a third word segment.

In step A4, under the condition that a target word group with an information value not smaller than the information value corresponding to the first word segment exists in the multi-component word group, determining the sub word segment included in the target word group as a third word segment.

In the implementation, assuming that the first word is "word-segmentation processing", the corresponding second word is "segmentation" and "processing", the second word corresponds to a plurality of word-segments, if the information value corresponding to the first word is greater than the information value corresponding to each of the plurality of word-segments in the multi-component word, the plurality of second words may be combined to obtain a third word, that is, the third word may be "segmentation processing", and if the information value in the multi-component word is not less than the target word-segment of the information value corresponding to the first word-segment, the sub word included in the target word-segment may be determined to be the third word-segment, for example, as shown in fig. 3, assuming that the information value of the word-segment 4 is 3, if the information value of the first word-segment is 2, the word-segment 4 may be determined to be the target word-segment, and the sub word included in the word-segment group 4 may be determined to be the third word, that is the third word-segment includes "segmentation" and "segmentation processing". In addition, in the case where there are a plurality of target word groups, the sub word included in the target word group having the largest information value may be determined as the third word.

In practical applications, the processing manner of the above S204 may be various, and the following provides an alternative implementation manner, which can be specifically referred to the following steps B1 to B4:

in step B1, based on a plurality of sub-word segments in the word segment group, determining an information value corresponding to each sub-word segment in the word segment group.

The information value corresponding to each sub-word can be used for representing the association strength among a plurality of characters included in the sub-word.

In implementation, the information value corresponding to each sub-word is determined through a PMI algorithm, so that the association strength among a plurality of characters included in the sub-word is represented through the information value of the sub-word. For example, as shown in fig. 4, the information value of the sub-word "secondary" is 0.1, and the information value of the "secondary" is 3, that is, the association strength between characters contained in the "secondary" is greater than the association strength between characters contained in the "secondary".

In step B2, when the information value corresponding to the first word is greater than the information value corresponding to each sub word in the group of words, combining the plurality of second words to obtain a third word.

In step B3, under the condition that the target sub-word with the information value larger than the information value corresponding to the first word exists in the plurality of sub-words, obtaining the word segmentation group corresponding to the target sub-word.

In step B4, the sub word corresponding to the word segmentation group corresponding to the target sub word is determined to be the third word segmentation.

In the implementation, assuming that the first word is "word segmentation processing", the corresponding second word is "segmentation" and "processing", if the information value corresponding to the first word is greater than the information value corresponding to each sub word in the group of the component words, the plurality of second words may be combined to obtain a third word, that is, the third word may be "segmentation processing".

If the target sub-word with the information value larger than the information value corresponding to the first word is present in the plurality of sub-words, a word segment group corresponding to the target sub-word can be obtained, and the sub-word corresponding to the word segment group corresponding to the target sub-word is determined to be the third word segment. For example, assuming that the information value of the first word segment is 2, and the information value of the "score" is 3 and is greater than the information value of the first word segment, as shown in fig. 4, the "score" may be determined as the target sub word segment, and the word segment group corresponding to the "score" may be acquired, where the word segment group corresponding to the target sub word segment is one, the sub word segment included in the word segment group corresponding to the target sub word segment may be determined as the third word segment, and where the word segment group corresponding to the target sub word segment is a plurality of, the third word segment may be determined based on the information value of the sub word segment included in the word segment group corresponding to each target sub word segment. Specifically, as shown in fig. 4, the word segmentation group corresponding to the "segmentation" includes a word segmentation group 5 and a word segmentation group 6, and the information value corresponding to the "process" in the word segmentation group 6 is greater than the information value corresponding to the "place" and the information value corresponding to the "process" in the word segmentation group 5, so that the sub-word included in the word segmentation group 6 can be determined as a third word, that is, the third word includes the "segmentation" and the "process".

In S206, a second word segmentation result is determined based on the first word segmentation and the third word segmentation.

In an implementation, it is assumed that the first word segmentation result may include: "segmentation", "processing", "obtaining", "word segmentation processing", and "result", the first word may be "word segmentation processing", and the third word may be "segmentation processing", then the second word result may include: "fractional processing", "obtaining", "word segmentation processing" and "result".

In S208, the second word segmentation result is input into a pre-trained intention recognition model to obtain a target intention corresponding to the target data.

The intention recognition model may be a model for performing an intention recognition process constructed based on a preset machine learning algorithm.

In the implementation, the second word segmentation result can be input into a pre-trained intention recognition model to obtain the target intention corresponding to the target data, and the second word segmentation result is obtained by correcting the first word segmentation result based on the phonetic symbol association relationship, so that the target intention corresponding to the target data is determined through the second word segmentation result, and the accuracy of the intention can be improved.

In S210, a risk detection result for the target data is determined based on the target intention.

In implementation, taking input data fed back by a user aiming at an intelligent question-answering system in a human-computer interaction process as an example, target intention corresponding to target data can be obtained based on a second word segmentation result corresponding to the target data and a pre-trained intention recognition model, and then a risk detection result aiming at the target data is determined based on the target intention. For example, if the target intention corresponding to the target data is a resource transfer intention, the risk detection result for the target data may be a high risk, and the target data may be sent to the corresponding data processing party for wind control processing.

Therefore, the first word segmentation result can be corrected through the phonetic symbol association relationship and the association strength of the characters to obtain a second word segmentation result, and the target intention corresponding to the target data can be accurately determined based on the second word segmentation result, so that the risk detection result of the target data can be timely and accurately determined based on the target intention, and the risk control efficiency and accuracy are improved.

Example III

The data processing method provided in the embodiment of the present disclosure is based on the same concept, and the embodiment of the present disclosure further provides a data processing device, as shown in fig. 5.

The data processing apparatus includes: a first obtaining module 501, a first determining module 502, a result correcting module 503 and a risk detecting module 504, wherein:

the first obtaining module 501 is configured to obtain a first word segmentation result corresponding to target data, where the target data is data generated in a human-computer interaction process;

the first determining module is used for determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation;

the result correction module is used for correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result;

and the risk detection module is used for carrying out risk detection processing on the target data based on the second word segmentation result.

In the embodiment of the present disclosure, the risk detection module 504 is configured to:

Inputting the second word segmentation result into a pre-trained intention recognition model to obtain target intention corresponding to the target data, wherein the intention recognition model is a model constructed based on a preset machine learning algorithm and used for carrying out intention recognition processing;

based on the target intent, a risk detection result for the target data is determined.

In the embodiment of the present disclosure, the result correction module 503 is configured to:

acquiring a plurality of second word fragments with phonetic symbol association relation with a first word fragment in the first word fragment result, wherein the first word fragment is any word fragment in the first word fragment result;

combining the plurality of second words to obtain a third word segment under the condition that the information value corresponding to the first word segment is larger than the information value corresponding to each of the plurality of second word segments;

and determining the second word segmentation result based on the first word segmentation and the third word segmentation.

recombining characters included in the plurality of second word segments to obtain multi-component word groups, wherein each word group comprises a plurality of sub word segments for forming the plurality of second word segments;

Determining information values corresponding to the word segmentation groups based on a plurality of sub-word segments in the word segmentation groups, wherein the information values corresponding to the word segmentation groups are used for representing the association degree among the plurality of sub-word segments;

and combining the plurality of second words under the condition that the information value corresponding to the first word segmentation is larger than the information value corresponding to each group of the multi-component word groups, so as to obtain the third word segmentation.

In an embodiment of the present disclosure, the apparatus further includes:

the word segmentation determining module is used for determining sub-word segmentation included in the target word segmentation group as the third word segmentation under the condition that the information value in the multi-component word group is not smaller than the target word segmentation group corresponding to the first word segmentation.

based on a plurality of sub-word fragments in the word fragment group, determining an information value corresponding to each sub-word fragment in the word fragment group, wherein the information value corresponding to each sub-word fragment is used for representing the association strength among a plurality of characters included in the sub-word fragment;

and combining the plurality of second words to obtain the third word when the information value corresponding to the first word is larger than the information value corresponding to each sub word in the group of the component words.

In an embodiment of the present disclosure, the apparatus further includes:

the second obtaining module is used for obtaining a word segmentation group corresponding to the target sub-word segment under the condition that the target sub-word segment with the information value larger than the information value corresponding to the first word segment exists in the plurality of sub-word segments;

and the second determining module is used for determining the sub word segment corresponding to the word segment group corresponding to the target sub word segment as the third word segment.

In this embodiment of the present disclosure, the information value is an intra-point mutual information value.

The embodiment of the specification provides a data processing device, which is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process, determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among a plurality of characters included in each word segmentation, correcting the first word segmentation result based on the information value corresponding to each word segmentation and phonetic symbol association relation among the words, obtaining a second word segmentation result, and performing risk detection processing on the target data based on the second word segmentation result. Therefore, the second word segmentation result is obtained by correcting the first word segmentation result through the phonetic symbol association relation and the information value, so that the word segmentation accuracy of the target data can be improved, and the efficiency and the accuracy of risk detection processing of the target data can be improved under the conditions of large data volume and complex data structure of the target data.

Example IV

Based on the same idea, the embodiment of the present disclosure further provides a data processing apparatus, as shown in fig. 6.

The data processing apparatus may vary considerably in configuration or performance and may include one or more processors 601 and memory 602, where the memory 602 may store one or more stored applications or data. Wherein the memory 602 may be transient storage or persistent storage. The application programs stored in the memory 602 may include one or more modules (not shown) each of which may include a series of computer executable instructions for use in a data processing apparatus. Still further, the processor 601 may be arranged to communicate with the memory 602 and execute a series of computer executable instructions in the memory 602 on a data processing apparatus. The data processing device may also include one or more power supplies 603, one or more wired or wireless network interfaces 604, one or more input/output interfaces 605, and one or more keyboards 606.

In particular, in this embodiment, the data processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing apparatus, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

Acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process;

determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation;

correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relation between the word segmentation to obtain a second word segmentation result;

and carrying out risk detection processing on the target data based on the second word segmentation result.

Optionally, the performing risk detection processing on the target data based on the second word segmentation result includes:

Optionally, the correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relationship between the word segmentation to obtain a second word segmentation result includes:

Optionally, when the information value corresponding to the first word is greater than the information value corresponding to each second word in the plurality of second words, merging the plurality of second words to obtain a third word, including:

Optionally, the method further comprises:

and under the condition that a target word group with the information value not smaller than the information value corresponding to the first word segmentation exists in the multi-component word group, determining the sub word included in the target word group as the third word segmentation.

Optionally, when the information value corresponding to the first word segment is greater than the information value corresponding to each group of groups, combining the plurality of second word segments to obtain the third word segment, including:

Optionally, the method further comprises:

under the condition that target sub-word groups with information values larger than the information values corresponding to the first sub-word exist in the sub-word groups, obtaining word groups corresponding to the target sub-word groups;

And determining the sub-word corresponding to the word segmentation group corresponding to the target sub-word as the third word segmentation.

Optionally, the information value is an intra-point mutual information value.

The embodiment of the specification provides data processing equipment, which is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process, determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation, correcting the first word segmentation result based on the information value corresponding to each word segmentation and phonetic symbol association relation among the words, obtaining a second word segmentation result, and performing risk detection processing on the target data based on the second word segmentation result. Therefore, the second word segmentation result is obtained by correcting the first word segmentation result through the phonetic symbol association relation and the information value, so that the word segmentation accuracy of the target data can be improved, and the efficiency and the accuracy of risk detection processing of the target data can be improved under the conditions of large data volume and complex data structure of the target data.

Example five

The embodiments of the present disclosure further provide a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements each process of the embodiments of the data processing method, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (RandomAccess Memory, RAM), magnetic disk or optical disk.

The embodiment of the specification provides a computer readable storage medium, which is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process, determining an information value corresponding to each word segmentation based on a plurality of characters included in each word segmentation in the first word segmentation result, wherein the information value corresponding to each word segmentation is used for representing the association strength among the plurality of characters included in each word segmentation, correcting the first word segmentation result based on the information value corresponding to each word and phonetic symbol association relation among the words, obtaining a second word segmentation result, and performing risk detection processing on the target data based on the second word segmentation result. Therefore, the second word segmentation result is obtained by correcting the first word segmentation result through the phonetic symbol association relation and the information value, so that the word segmentation accuracy of the target data can be improved, and the efficiency and the accuracy of risk detection processing of the target data can be improved under the conditions of large data volume and complex data structure of the target data.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (FieldProgrammable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not only one, but a plurality of kinds, such as ABEL (AdvancedBoolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby HardwareDescription Language), etc., VHDL (Very-High-SpeedIntegrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmelAT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flashRAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A data processing method, comprising:

performing risk detection processing on the target data based on the second word segmentation result;

the method for correcting the first word segmentation result based on the phonetic symbol association relation between the information value corresponding to each word segmentation and the word segmentation, and obtaining a second word segmentation result comprises the following steps:

determining a second word segmentation result based on a first word segmentation and a third word segmentation, wherein the first word segmentation is any word segmentation in the first word segmentation result, the third word segmentation is a word segmentation obtained by combining a plurality of second words under the condition that the information value corresponding to the first word segmentation is larger than the information value corresponding to each of the plurality of second words, and the second word segmentation is a word segmentation with phonetic symbol association relation with the first word segmentation in the first word segmentation result.

2. The method of claim 1, wherein the performing risk detection processing on the target data based on the second word result includes:

3. The method of claim 1, wherein the correcting the first word segmentation result based on the information value corresponding to each word segmentation and the phonetic symbol association relationship between the words to obtain a second word segmentation result includes:

acquiring a plurality of second word segmentation results, wherein the second word segmentation results have phonetic symbol association relations with the first word segmentation results;

combining the plurality of second words to obtain the third word when the information value corresponding to the first word is larger than the information value corresponding to each of the plurality of second words;

4. The method according to claim 3, wherein when the information value corresponding to the first word is greater than the information value corresponding to each of the plurality of second words, merging the plurality of second words to obtain a third word, including:

5. The method of claim 4, the method further comprising:

6. The method of claim 4, wherein in the case that the information value corresponding to the first word segment is greater than the information value corresponding to each of the multi-component word segments, combining the plurality of second segmentation words to obtain the third segmentation word, wherein the method comprises the following steps:

7. The method of claim 6, the method further comprising:

8. The method of claim 1, the information value being an intra-point mutual information value.

9. A data processing apparatus comprising:

the first acquisition module is used for acquiring a first word segmentation result corresponding to target data, wherein the target data is data generated in a human-computer interaction process;

the risk detection module is used for carrying out risk detection processing on the target data based on the second word segmentation result;

wherein, the result correction module is used for:

10. A data processing apparatus, the data processing apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

11. A storage medium for storing computer-executable instructions that when executed implement the following: