CN115422423A - Client portrait determination method and device, electronic equipment and storage medium - Google Patents

Client portrait determination method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115422423A
CN115422423A CN202211063634.2A CN202211063634A CN115422423A CN 115422423 A CN115422423 A CN 115422423A CN 202211063634 A CN202211063634 A CN 202211063634A CN 115422423 A CN115422423 A CN 115422423A
Authority
CN
China
Prior art keywords
corpus
training
target
corpora
determined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211063634.2A
Other languages
Chinese (zh)
Inventor
莫群
徐洪军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kerui Jinxin Technology Co ltd
Zhejiang University ZJU
Original Assignee
Beijing Kerui Jinxin Technology Co ltd
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kerui Jinxin Technology Co ltd, Zhejiang University ZJU filed Critical Beijing Kerui Jinxin Technology Co ltd
Priority to CN202211063634.2A priority Critical patent/CN115422423A/en
Publication of CN115422423A publication Critical patent/CN115422423A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for determining a customer portrait, electronic equipment and a storage medium, wherein a pre-training corpus is obtained from the pre-training corpus, and the pre-training corpus comprises at least two pre-training corpora; determining the target scale of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale; combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses; and inputting the at least two target corpora into a unary recognition model, and acquiring a client image recognition result of the corpora to be determined, which is output by the unary recognition model, wherein the unary recognition model is used for recognizing a single word in the corpora. Compared with the prior art, the embodiment of the disclosure identifies the client portrait of the corpus to be determined by merging and processing the corpus to be determined, and the unary identification model can effectively improve the identification accuracy of the client portrait.

Description

Client portrait determination method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining a customer portrait, an electronic device, and a storage medium.
Background
The popularity of the internet has led to an exponential increase in the amount of textual information that we can access, with many unknown sources of text. The personal language style refers to a habitual expression formed by a person in a given language. In order to be able to identify text attribution for which the identity of the client is unknown, attribution of text can be determined by comparing the text for which the identity of the client is unknown with the differences in the language style characteristics of known clients. However, the accuracy of identifying the customer portrait still needs to be improved.
Disclosure of Invention
The present disclosure provides a method, an apparatus, an electronic device and a storage medium for determining a customer figure. The method mainly aims to solve the problem of low accuracy of a customer image identification method.
According to a first aspect of the present disclosure, a method for determining a client representation is provided, comprising:
acquiring pre-training corpora in a pre-training corpus, wherein the pre-training corpus comprises at least two pre-training corpora;
determining the target scale of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale;
combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses;
and inputting the at least two target linguistic data into a unitary recognition model, and acquiring a client image recognition result of the linguistic data to be determined, which is output by the unitary recognition model, wherein the unitary recognition model is used for recognizing a single word in the linguistic data.
Optionally, the performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale includes:
determining respective expansion coefficients of the at least two pre-training corpora according to the target scale;
and performing expansion and contraction according to the respective expansion and contraction coefficients of the at least two pre-training corpora.
Optionally, the merging the corpus to be determined and the at least two stretched pre-training corpora to obtain at least two target corpora includes:
calling a preset superposition algorithm, and superposing the corpus to be determined with at least two stretched pre-training corpuses respectively;
and performing statement smoothing on the result of the superposition processing to obtain the at least two target linguistic data, wherein the probability of each word in the at least two target linguistic data is not 0.
Optionally, the obtaining of the client image recognition result of the corpus to be determined output by the unary recognition model includes:
determining the word frequency of each word in the corpus to be determined in each target corpus in the unary recognition model;
calculating the probability value of each sentence in the corpus to be determined according to the word frequency in each target corpus;
and determining the client portrait recognition result according to the probability value of each sentence.
Optionally, the determining, in the unary recognition model, a word frequency of each word in the corpus to be determined in each target corpus is implemented by the following formula:
p(w i )=count(w i )/count(words)
count(w i ) Represents the word w i The number of occurrences in the target corpus, and count (words) represents the total number of occurrences of all words in the target corpus.
Optionally, the calculating, according to the word frequency, a probability value of each sentence in each target corpus according to the word frequency is implemented by the following formula:
p(s)=p(w 1 )p(w 2 )p(w 3 )...p(w n )
or the like, or, alternatively,
log p(s)=log p(w 1 )+logp(w 2 )+log p(w 3 )+...+log p(w n )
w i representing a word or phrase appearing in the target corpus, and s represents a sentence in the corpus to be determined.
According to a second aspect of the present disclosure, there is provided a client representation determining apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring pre-training corpora in a pre-training corpus, and the pre-training corpus comprises at least two pre-training corpora;
the expansion unit is used for determining the target scale of the at least two pre-training corpora and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale;
the merging unit is used for merging the linguistic data to be determined and the at least two stretched pre-training linguistic data to obtain at least two target linguistic data;
and the determining unit is used for inputting the at least two target linguistic data into a unitary recognition model and acquiring a client image recognition result of the linguistic data to be determined, wherein the client image recognition result is output by the unitary recognition model, and the unitary recognition model is used for recognizing a single word in the linguistic data.
Optionally, the telescopic unit includes:
the first determining module is used for determining the respective expansion coefficients of the at least two pre-training corpora according to the target scale;
and the expansion module is used for expanding according to the respective expansion coefficients of the at least two pre-training corpora.
Optionally, the merging unit includes:
the calling module is used for calling a preset superposition algorithm and superposing the corpus to be determined with at least two stretched pre-training corpuses respectively;
and the processing module is used for performing statement smoothing processing on the result of the superposition processing to obtain the at least two target linguistic data, wherein the probability of each word in the at least two target linguistic data is not 0.
Optionally, the determining unit includes:
a second determining module, configured to determine, in the unary recognition model, a word frequency of each word in the corpus to be determined in each target corpus;
the calculation module is used for calculating the probability value of each sentence in the linguistic data to be determined in each target linguistic data according to the word frequency;
and the third determining module is used for determining the client portrait recognition result according to the probability value of each sentence.
Optionally, the second determining module is implemented by the following formula:
p(w i )=count(w i )/count(words)
count(w i ) Representing a word w i The number of occurrences in the target corpus, and count (words) represents the total number of occurrences of all words in the target corpus.
Optionally, the calculating module is implemented by the following formula:
p(s)=p(w 1 )p(w 2 )p(w 3 )...p(w n )
or the like, or, alternatively,
log p(s)=log p(w 1 )+log p(w 2 )+log p(w 3 )+...+log p(w n )
w i representing a word or phrase appearing in the target corpus, and s represents a sentence in the corpus to be determined.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the aforementioned first aspect.
According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as set forth in the preceding first aspect.
The invention provides a method and a device for determining a customer portrait, electronic equipment and a storage medium, wherein a pre-training corpus is obtained, and the pre-training corpus comprises at least two pre-training corpora; determining the target scale of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale; combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses; and inputting the at least two target corpora into a unary recognition model, and acquiring a client image recognition result of the corpora to be determined, which is output by the unary recognition model, wherein the unary recognition model is used for recognizing a single word in the corpora. Compared with the prior art, the embodiment of the disclosure identifies the client portrait of the corpus to be determined after merging and processing the corpus to be determined, and the unary identification model can effectively improve the identification accuracy of the client portrait.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present application, nor are they intended to limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow chart diagram illustrating a method for determining a client portrait according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of text expansion and contraction provided in an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart of a unary recognition model recognition method according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating a client portrait determination apparatus according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating an alternative exemplary client representation determining apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic block diagram of an example electronic device 500 provided by embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A client portrait determination method, apparatus, electronic device, and storage medium of the embodiments of the present disclosure are described below with reference to the drawings.
FIG. 1 is a flowchart illustrating a method for determining a client portrait according to an embodiment of the disclosure.
As shown in fig. 1, the method comprises the following steps:
step 101, obtaining pre-training corpora in a pre-training corpus, wherein the pre-training corpus comprises at least two pre-training corpora.
In some embodiments, the following implementations may be employed, but are not limited to, including: the pre-training corpus is a text pre-processed by a Natural Language processing package (NLTK). Performing word segmentation processing on all the integrated texts by using the NLTK, counting the number of words, and determining the scale of each text; and each processed text obtains a corresponding pre-training corpus, and each pre-training corpus has unique corresponding customer portrait identification information. In the embodiment of the disclosure, the NLTK segments the text into the unitary model, and stores the processed pre-training corpus in which a large amount of pre-processed pre-training corpus is included. The text processed by the NLTK forms at least two pre-training corpora. It should be noted that the present disclosure uses NLTK as an example, and does not limit what natural language processing tool is used in the present disclosure.
Step 102, determining target scales of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scales.
The inconsistent scale of the pre-training corpus may cause the target corpus to be composed of the pre-training corpus and the corpus to be determined to have different sizes, which may cause the occurrence of a condition that the word frequency of each word in the corpus to be determined in the target corpus has an obvious deviation. In order to take account of the unbalanced size of the corpus. Text expansion and contraction are required to be performed on the pre-training corpora, so that the scale of each pre-training corpus in the pre-training corpus is equal, that is, the total word number of the word list is equal.
And 103, combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses.
The incomplete information can lead to sparse data, and due to the difference between the corpus to be determined and the pre-training corpus and the limitation of the pre-training corpus, it cannot be guaranteed that all words in the corpus to be determined appear in the pre-training corpus, the probability of a certain word is zero, and no matter how frequent other words are in front, the product of the final probabilities is zero. It is necessary to assign a non-zero probability value to all possible occurrences of the word to avoid this. In order to solve the problem of zero probability caused by data sparseness, after text expansion is carried out on at least two pre-training corpora, the corpora to be determined are respectively combined with the at least two expanded pre-training corpora to obtain at least two target corpora; and processed using a preset algorithm.
And 104, inputting the at least two target linguistic data into a unitary recognition model, and obtaining a client image recognition result of the linguistic data to be determined, which is output by the unitary recognition model, wherein the unitary recognition model is used for recognizing a single word in the linguistic data.
And inputting the processed target corpus into the model based on the unary recognition model of the embodiment of the disclosure. The unary recognition model further obtains a probability value corresponding to the corpus to be determined by performing word frequency calculation on words in the corpus to be determined so as to judge a client image recognition result.
The invention provides a method for determining a customer portrait, which comprises the steps of obtaining pre-training corpora in a pre-training corpus, wherein the pre-training corpus comprises at least two pre-training corpora; determining the target scale of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale; combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses; and inputting the at least two target corpora into a unary recognition model, and acquiring a client image recognition result of the corpora to be determined, which is output by the unary recognition model, wherein the unary recognition model is used for recognizing a single word in the corpora. Compared with the prior art, the embodiment of the disclosure identifies the client portrait of the corpus to be determined after merging and processing the corpus to be determined under the unitary model. The method and the device can improve the recognition success rate of the customer portrait.
To more intuitively understand the text expansion and contraction process, please refer to fig. 2, and fig. 2 is a schematic flow chart of text expansion and contraction provided by an embodiment of the present disclosure. The performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale includes:
step 201, determining the respective expansion coefficients of the at least two pre-training corpora according to the target scale.
According to the determined target scale, the target scale is the largest scale in all the pre-training corpora; and determining a telescopic coefficient of each pre-training corpus, wherein the telescopic coefficient is the ratio of the target scale to the scale of the pre-training corpus. And multiplying the word number of each word in the pre-training corpus by the expansion coefficient to obtain the corresponding times in the expanded pre-training corpus.
And step 202, performing expansion and contraction according to respective expansion and contraction coefficients of the at least two pre-training corpora.
And stretching each pre-training corpus according to the stretching coefficient of each pre-training corpus to obtain the stretched pre-training corpus. And the number of each word in the pre-training corpus after expansion and contraction is the product of the number of each word in the original pre-training corpus and the expansion and contraction coefficient.
Further, in a possible implementation manner of this embodiment, the merging the corpus to be determined and the at least two stretched pre-training corpora to obtain at least two target corpora includes:
calling a preset superposition algorithm, and superposing the corpus to be determined with at least two stretched pre-training corpuses respectively;
and performing statement smoothing on the result of the superposition processing to obtain the at least two target linguistic data, wherein the probability of each word in the at least two target linguistic data is not 0.
In identifying the client representation, a non-zero probability value is required for each word in the sentence, and if there is a zero probability result, the result of the entire calculation formula is zero, so that all possible words must be assigned a non-zero probability value to avoid this. In the embodiment of the disclosure, a preset superposition algorithm is used for respectively adding the vocabulary of the corpus to be determined to each pre-training corpus vocabulary after expansion, and the scale of each obtained target corpus is consistent; the target corpus is consistent with the client portrait identification information of the pre-training corpus. And processing each obtained target corpus by using an addition smoothing method. It should be noted that, in the present embodiment, the addition smoothing method is used for performing the term smoothing processing on the result of the superimposition processing, and this does not constitute a limitation of what algorithm is used for performing the smoothing processing in the present disclosure.
To better explain how the unary recognition model recognizes the customer image, please refer to fig. 3, and fig. 3 is a flowchart illustrating a method for recognizing the unary recognition model according to an embodiment of the disclosure. The obtaining of the client image recognition result of the corpus to be determined output by the unary recognition model comprises:
step 301, determining the word frequency of each word in the corpus to be determined in each target corpus in the unary recognition model.
And determining the word frequency of each word of the corpus to be determined in each target corpus by using the unitary recognition model. The probability of each word occurrence is replaced by statistical word frequency approximation.
Step 302, calculating a probability value of each sentence in each target corpus according to the word frequency.
The probability of each word occurrence is replaced by a word frequency approximation of each word occurrence, assuming that a word occurrence is only related to itself, and not to words that have occurred previously. In each target corpus, the probability value of each sentence of the corpus to be determined can be calculated according to the probability value of each word; i.e. the product of the probability values of the occurrence of each word in the sentence is the probability value of the sentence. And calculating the probability value of each sentence in the corpus to be determined in each target corpus by using the unitary model.
Step 303, determining the client portrait recognition result according to the probability value of each sentence.
Calculating a probability value of each sentence in the corpus to be determined through a unitary model, and comparing the obtained probability values of each sentence in all the corpus to be determined; and obtaining client portrait identification information of the target corpus corresponding to the maximum probability value, and considering that the corpus to be determined is from the client corresponding to the target corpus.
Further, in a possible implementation manner of this embodiment, the determining, in the unary recognition model, a word frequency of each word in the corpus to be determined is implemented by the following formula:
p(w i )=count(w i )/count(words)
count(w i ) Represents the word w i The number of occurrences in the target corpus, and count (words) represents the total number of occurrences of all words in the target corpus.
In the disclosed embodiments, a univariate recognition model is used to identify the attribution issues of the customer images. The probability of each word occurrence is replaced by a word frequency approximation of each word occurrence.
Further, in a possible implementation manner of this embodiment, the calculating a probability value of each sentence in the corpus to be determined according to the word frequency is implemented by the following formula:
p(s)=p(w 1 )p(w 2 )p(w 3 )...p(w n )
or the like, or, alternatively,
log p(s)=log p(w 1 )+log p(w 2 )+log p(w 3 )+...+log p(w n )
w i representing a word or phrase appearing in the target corpus, and s represents a sentence in the corpus to be determined.
Assuming that the occurrence of a word is only related to itself, not to words that have already occurred before, the sentence s = w 1 w 2 w 3 ……w n The probability of occurrence is the product of the probability values of each word occurrence. In a specific calculation, if the test statement is too long, the result of the probability product is very small, and even data overflow is caused. It is thus possible to perform a logarithmic operation on the product of the probabilities.
Corresponding to the method for determining a client portrait, the invention also provides a device for determining a client portrait. Since the device embodiment of the present invention corresponds to the method embodiment described above, details that are not disclosed in the device embodiment may refer to the method embodiment described above, and are not described again in the present invention.
Fig. 4 is a schematic structural diagram of a client portrait determination apparatus according to an embodiment of the present disclosure, as shown in fig. 4, including: an acquisition unit 41, a scaling unit 42, a merging unit 43 and a determination unit 44.
An obtaining unit 41, configured to obtain a pre-training corpus in a pre-training corpus, where the pre-training corpus includes at least two pre-training corpora;
a scaling unit 42, configured to determine target scales of the at least two pre-training corpora, and perform text scaling on different pre-training corpora of the pre-training corpus according to the target scales;
a merging unit 43, configured to merge the corpus to be determined and the at least two stretched pre-training corpora to obtain at least two target corpora;
the determining unit 44 is configured to input the at least two target corpora into a unary recognition model, and obtain a client image recognition result of the to-be-determined corpora output by the unary recognition model, where the unary recognition model is used to recognize a single word in the corpora.
Further, in a possible implementation manner of this embodiment, as shown in fig. 5, the expansion unit 42 includes:
a first determining module 421, configured to determine, according to the target scale, respective expansion coefficients of the at least two pre-training corpuses;
and the expansion module 422 is configured to expand and contract according to the respective expansion coefficients of the at least two pre-training corpora.
Further, in a possible implementation manner of this embodiment, as shown in fig. 5, the merging unit 43 includes:
the calling module 431 is configured to call a preset superposition algorithm, and superpose the corpus to be determined with the at least two stretched pre-trained corpora respectively;
and the processing module 432 is configured to perform statement smoothing on the result of the stacking processing to obtain the at least two target corpuses, where a probability of occurrence of each word in the at least two target corpuses is not 0.
Further, in a possible implementation manner of this embodiment, as shown in fig. 5, the determining unit 44 includes:
a second determining module 441, configured to determine, in the unary recognition model, a word frequency of each word in the corpus to be determined in each target corpus;
a calculating module 442, configured to calculate, according to the word frequency, a probability value of each sentence in the corpus to be determined in each target prediction;
a third determining module 443 configured to determine the client representation recognition result according to the probability value of each sentence.
Further, in a possible implementation manner of this embodiment, the second determining module 441 is implemented by the following formula:
p(w i )=count(w i )/count(words)
count(w i ) Represents the word w i The number of occurrences in the target corpus, and count (words) represents the total number of occurrences of all words in the target corpus.
Further, in a possible implementation manner of this embodiment, the calculating module 442 is implemented by the following formula:
p(s)=p(w 1 )p(w 2 )p(w 3 )...p(w n )
or the like, or a combination thereof,
log p(s)=log p(w 1 )+log p(w 2 )+log p(w 3 )+...+log p(w n )
w i representing a word or phrase appearing in the target corpus, and s represents a sentence in the corpus to be determined.
It should be noted that the foregoing explanation of the method embodiment is also applicable to the apparatus of the present embodiment, and the principle is the same, and the present embodiment is not limited thereto.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the device 500 includes a computing unit 501, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 502 or a computer program loaded from a storage unit 508 into a RAM (Random Access Memory) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An I/O (Input/Output) interface 505 is also connected to the bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing Unit 501 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The computing unit 501 performs the various methods and processes described above, such as the determination of a customer representation. For example, in some embodiments, the method of determining a client representation may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 500 via ROM 502 and/or communications unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the method described above may be performed. Alternatively, in other embodiments, computing unit 501 may be configured in any other suitable manner (e.g., by way of firmware) to perform the aforementioned customer representation determination methods.
Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, system On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (liquid crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and has both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (10)

1. A method for determining a client representation, comprising:
acquiring a pre-training corpus in a pre-training corpus, wherein the pre-training corpus comprises at least two pre-training corpora;
determining the target scale of the at least two pre-training corpora, and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale;
combining the corpus to be determined and the at least two stretched pre-training corpuses to obtain at least two target corpuses;
and inputting the at least two target corpora into a unary recognition model, and acquiring a client image recognition result of the corpora to be determined, which is output by the unary recognition model, wherein the unary recognition model is used for recognizing a single word in the corpora.
2. The method according to claim 1, wherein the text-scaling the different pre-training corpora of the pre-training corpus to the target scale comprises:
determining respective expansion coefficients of the at least two pre-training corpora according to the target scale;
and performing expansion and contraction according to the respective expansion and contraction coefficients of the at least two pre-training corpora.
3. The method according to claim 1, wherein the merging the corpus to be determined and the at least two pre-training corpora after the expansion and contraction to obtain at least two target corpora comprises:
calling a preset superposition algorithm, and superposing the corpus to be determined with at least two stretched pre-training corpuses respectively;
and performing statement smoothing processing on the result of the superposition processing to obtain the at least two target linguistic data, wherein the probability of each word in the at least two target linguistic data is not 0.
4. The method according to any one of claims 1 to 3, wherein the obtaining of the client image recognition result of the corpus to be determined output by the unary recognition model comprises:
determining the word frequency of each word in the corpus to be determined in each target corpus in the unary recognition model;
calculating a probability value of each sentence in the corpus to be determined in each target corpus according to the word frequency;
determining the client portrait recognition result according to the probability value of each sentence;
wherein the word frequency p (w) of each word i ) The method is realized by the following formula:
p(w i )=count(w i )/count(words)
the probability value of each sentence is realized by the following formula:
p(s)=p(w 1 )p(w 2 )p(w 3 )...p(w n )
or the like, or a combination thereof,
log p(s)=log p(w 1 )+logp(w 2 )+log p(w 3 )+...+log p(w n )
wherein w i Representing words or phrases appearing in the target corpus, s representing a sentence in the corpus to be determined; count (w) i ) Representing a word w i The number of occurrences in the target corpus, and count (words) represents the total number of occurrences of all words in the target corpus.
5. A client representation determination apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring pre-training corpora in a pre-training corpus, and the pre-training corpus comprises at least two pre-training corpora;
the expansion unit is used for determining the target scale of the at least two pre-training corpora and performing text expansion on different pre-training corpora of the pre-training corpus according to the target scale;
the merging unit is used for merging the linguistic data to be determined and the at least two stretched pre-training linguistic data to obtain at least two target linguistic data;
and the determining unit is used for inputting the at least two target corpora into a unary recognition model and acquiring a client image recognition result of the corpora to be determined, which is output by the unary recognition model, wherein the unary recognition model is used for recognizing a single word in the corpora.
6. The apparatus of claim 5, wherein the telescoping unit comprises:
the first determining module is used for determining the respective expansion coefficients of the at least two pre-training corpora according to the target scale;
and the expansion module is used for expanding according to the respective expansion coefficients of the at least two pre-training corpora.
7. The apparatus of claim 5, wherein the merging unit comprises:
the calling module is used for calling a preset superposition algorithm and superposing the corpus to be determined with at least two stretched pre-training corpuses respectively;
the processing module is used for performing statement smoothing processing on the result of the superposition processing to obtain the at least two target linguistic data, wherein the probability of each word in the at least two target linguistic data is not 0;
the determination unit includes:
a second determining module, configured to determine, in the unary recognition model, a word frequency of each word in the corpus to be determined in the target corpus;
the calculation module is used for calculating the probability value of each sentence in the linguistic data to be determined in each target linguistic data according to the word frequency;
and the third determining module is used for determining the client portrait recognition result according to the probability value of each sentence.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
10. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN202211063634.2A 2022-09-01 2022-09-01 Client portrait determination method and device, electronic equipment and storage medium Pending CN115422423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211063634.2A CN115422423A (en) 2022-09-01 2022-09-01 Client portrait determination method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211063634.2A CN115422423A (en) 2022-09-01 2022-09-01 Client portrait determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115422423A true CN115422423A (en) 2022-12-02

Family

ID=84199710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211063634.2A Pending CN115422423A (en) 2022-09-01 2022-09-01 Client portrait determination method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115422423A (en)

Similar Documents

Publication Publication Date Title
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
EP3992814A2 (en) Method and apparatus for generating user interest profile, electronic device and storage medium
CN115481229A (en) Method and device for pushing answer call, electronic equipment and storage medium
CN115358243A (en) Training method, device, equipment and storage medium for multi-round dialogue recognition model
CN112906368B (en) Industry text increment method, related device and computer program product
CN114861758A (en) Multi-modal data processing method and device, electronic equipment and readable storage medium
CN114758649B (en) Voice recognition method, device, equipment and medium
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN113743127B (en) Task type dialogue method, device, electronic equipment and storage medium
CN112560481B (en) Statement processing method, device and storage medium
CN113408269B (en) Text emotion analysis method and device
CN112784599B (en) Method and device for generating poem, electronic equipment and storage medium
CN114973333A (en) Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN114970666A (en) Spoken language processing method and device, electronic equipment and storage medium
CN115422423A (en) Client portrait determination method and device, electronic equipment and storage medium
CN113806541A (en) Emotion classification method and emotion classification model training method and device
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN112632999A (en) Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN116244432B (en) Pre-training method and device for language model and electronic equipment
CN114781409B (en) Text translation method, device, electronic equipment and storage medium
CN114091483B (en) Translation processing method and device, electronic equipment and storage medium
US20230102422A1 (en) Image recognition method and apparatus, and storage medium
CN113947082A (en) Word segmentation processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination