CN110427330B - Code analysis method and related device - Google Patents

Code analysis method and related device Download PDF

Info

Publication number
CN110427330B
CN110427330B CN201910747791.7A CN201910747791A CN110427330B CN 110427330 B CN110427330 B CN 110427330B CN 201910747791 A CN201910747791 A CN 201910747791A CN 110427330 B CN110427330 B CN 110427330B
Authority
CN
China
Prior art keywords
analyzed
vector
output
error
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910747791.7A
Other languages
Chinese (zh)
Other versions
CN110427330A (en
Inventor
赵旸
刘思凡
邱旻峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910747791.7A priority Critical patent/CN110427330B/en
Publication of CN110427330A publication Critical patent/CN110427330A/en
Application granted granted Critical
Publication of CN110427330B publication Critical patent/CN110427330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a code analysis method and a related device, which are used for calculating the initial probability of each output vector in a code text to be analyzed and the termination probability of each output vector in the code text to be analyzed through the comparison of the word vector to be analyzed and the error word vector after the word vector to be analyzed of the code text to be analyzed is obtained, so that the initial position and the termination position of a target code segment in the code text to be analyzed are determined according to the initial probability and the termination probability, and the target code segment is the error code segment obtained by an analysis result. The method for calculating the vector probability determines the target code segment, can analyze complex error types, and solves the technical problem that complex errors in codes cannot be checked at present.

Description

Code analysis method and related device
Technical Field
The application relates to the technical field of Internet, in particular to a code analysis method and a related device.
Background
With the development of the internet, various computer software or mobile phone software appears on the market, and software developers develop the computer software or the mobile phone software by writing codes, so that the method is the most widely and most common work in the computer industry. How to perform code analysis tasks such as error detection, code generation, code complement, and the like on codes written by software developers has become an industry hotspot.
Currently, a server can check specific errors possibly occurring in codes, such as code specifications, code security, code repetition rate and the like, through rules set by staff, for example, the server can check that brackets and the like are not used in the codes according to the specifications.
However, this method can only check the error of the simple code specification in the code, and cannot check the more complex error in the code.
Disclosure of Invention
The embodiment of the application provides a code analysis method and a related device, which are used for solving the technical problem that complex errors in codes cannot be checked at present.
In view of this, a first aspect of the present application provides a method for code analysis, including:
acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
n output vectors corresponding to the combined word vectors are obtained through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
Determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
and generating a code analysis result of the code text to be analyzed according to the target code segment.
A second aspect of an embodiment of the present application provides an apparatus for code analysis, including:
the acquisition unit is used for acquiring N word vectors to be analyzed corresponding to the code text to be analyzed and error word vectors corresponding to the error code text, wherein the error code text represents the code text matched with the code text to be analyzed, and N is an integer greater than 1;
the processing unit is used for obtaining N output vectors corresponding to the combined word vectors through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
the processing unit is further used for calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
The processing unit is further used for determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
and the generating unit is used for generating a code analysis result of the code text to be analyzed according to the target code segment.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the processing unit is further configured to:
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
the calculating, according to the N output vectors and the error word vector, the initial probability of each output vector in the N output vectors in the code text to be analyzed includes:
determining a starting weight fraction corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer greater than or equal to 1 and less than or equal to N;
Determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining the initial probability according to the initial weight score corresponding to the ith output vector and the initial weight total score;
the calculating, according to the N output vectors and the error word vector, a termination probability of each of the N output vectors in a code text to be analyzed includes:
determining termination weight scores corresponding to the j-th output vector according to the j-th output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining the termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the processing unit is further configured to:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
Determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein the determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest initial probability;
determining the starting position of the target code segment according to the output vector with the highest starting probability and the mapping relation between the output vector and the code text to be analyzed;
wherein the determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
and determining the target code segment according to the starting position and the ending position.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the processing unit is further configured to:
Determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
and acquiring the N output vectors corresponding to the combined word vector through the neural network model.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the processing unit is further configured to:
acquiring the code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the marker sequence to be analyzed;
acquiring the set error code text;
converting the error code text into an error marker sequence, wherein the error marker sequence is formed by converting each word or symbol in the error code text;
And generating the error word vector through the word vector tool according to the error marking sequence.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the combined word vector further includes a matching identifier, where the matching identifier includes a first matching identifier and a second matching identifier, the first matching identifier is used to indicate that the to-be-analyzed tag corresponding to the combined word vector matches the error tag in the error tag sequence, and the second matching identifier is used to indicate that the to-be-analyzed tag corresponding to the combined word vector does not match the error tag in the error tag sequence.
In a possible design, in an implementation manner of the second aspect of the embodiment of the present application, the combined word vector further includes a duty cycle analysis tag, and the duty cycle analysis tag identifies a duty cycle of the duty cycle analysis tag in the tag sequence to be analyzed.
In one possible design, in an implementation manner of the second aspect of the embodiment of the present application, the processing unit is further configured to:
acquiring a first word vector sequence formed by the combination word vector positive sequence arrangement;
acquiring a second word vector sequence formed by arranging the combined word vectors in an inverted order;
The method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
A third aspect of an embodiment of the present application provides a server, including: memory, transceiver, processor, and bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory, and comprises the following steps:
acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
N output vectors corresponding to the combined word vectors are obtained through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
generating a code analysis result of the code text to be analyzed according to the target code segment;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
The processor is used for executing the program in the memory, and comprises the following steps:
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
The calculating, according to the N output vectors and the error word vector, the initial probability of each output vector in the N output vectors in the code text to be analyzed includes:
determining a starting weight fraction corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining the initial probability according to the initial weight score corresponding to the ith output vector and the initial weight total score;
the calculating, according to the N output vectors and the error word vector, a termination probability of each of the N output vectors in a code text to be analyzed includes:
determining termination weight scores corresponding to the j-th output vector according to the j-th output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining the termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
The processor is used for executing the program in the memory, and comprises the following steps:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein the determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest initial probability;
determining the starting position of the target code segment according to the output vector with the highest starting probability and the mapping relation between the output vector and the code text to be analyzed;
wherein the determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
And determining the target code segment according to the starting position and the ending position.
The processor is used for executing the program in the memory, and comprises the following steps:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
n output vectors corresponding to the combined word vectors are obtained through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors.
The processor is used for executing the program in the memory, and comprises the following steps:
acquiring the code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the marker sequence to be analyzed;
Acquiring the set error code text;
converting the error code text into an error marker sequence, wherein the error marker sequence is formed by converting each word or symbol in the error code text;
and generating the error word vector through the word vector tool according to the error marking sequence.
The combined word vector further comprises a matching identifier, the matching identifier comprises a first matching identifier and a second matching identifier, the first matching identifier is used for indicating that a mark to be analyzed corresponding to the combined word vector is matched with an error mark in the error mark sequence, and the second matching identifier is used for indicating that the mark to be analyzed corresponding to the combined word vector is not matched with the error mark in the error mark sequence.
The combined word vector further comprises a duty ratio analysis mark, and the duty ratio analysis mark identifies the duty ratio of the duty ratio analysis mark in the mark sequence to be analyzed.
The processor is used for executing the program in the memory, and comprises the following steps:
acquiring a first word vector sequence formed by the combination word vector positive sequence arrangement;
acquiring a second word vector sequence formed by arranging the combined word vectors in an inverted order;
The method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
A fourth aspect of the embodiments of the application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
A fifth aspect of the embodiments of the application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
From the above technical solutions, the embodiment of the present application has the following advantages:
after the word vector to be analyzed of the code text to be analyzed is obtained, the initial probability of each output vector in the code text to be analyzed and the ending probability of each output vector in the code text to be analyzed are calculated through comparison of the word vector to be analyzed and the error word vector, so that the initial position and the ending position of a target code segment in the code text to be analyzed are determined according to the initial probability and the ending probability, and the target code segment is the error code segment obtained by an analysis result. The method for calculating the vector probability determines the target code segment, can analyze complex error types, and solves the technical problem that complex errors in codes cannot be checked at present.
Drawings
FIG. 1 is a block diagram of a developer platform according to an embodiment of the present application;
FIG. 2 is a representation of highlighting error codes in an embodiment of the present application;
FIG. 3 is a program display interface on a terminal device of a software developer;
FIG. 4 is an interface diagram for a manager to log into a developer platform to view;
FIG. 5 is a schematic diagram of an embodiment of a code analysis method according to an embodiment of the present application;
FIG. 6 is a diagram illustrating the generation of a combined word vector in accordance with an embodiment of the present application;
FIG. 7 is a diagram of error code block vectors according to an embodiment of the present application;
FIG. 8 is another diagram of error code block vectors in accordance with an embodiment of the present application;
fig. 9 is a flowchart of an embodiment of the present application applied to a terminal device;
FIG. 10 is a schematic diagram of calculating a start probability and a stop probability by using error code block vectors and respective output vectors according to an embodiment of the present application;
FIG. 11 is a schematic diagram of calculating an attention score in an embodiment of the application;
FIG. 12 is a schematic diagram of calculating an attention score in an embodiment of the application;
FIG. 13 is a schematic diagram of converting a code text to be analyzed into a tag sequence according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a server or terminal device inputting a first word vector sequence into a forward LSTM network model;
FIG. 15 is a schematic diagram of a server or terminal device inputting a second word vector sequence into a reverse LSTM network model;
FIG. 16 is a schematic diagram of an application of a code analysis method according to an embodiment of the present application;
FIG. 17 is a schematic diagram of an application example of a code analysis method according to an embodiment of the present application;
FIG. 18 is a diagram of a display interface during server training;
FIG. 19 is a schematic diagram of an apparatus for code analysis according to an embodiment of the present application;
fig. 20 is a schematic diagram of a server structure according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a code analysis method and a related device, which are used for solving the technical problem that complex errors in codes cannot be checked at present.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be appreciated that after the software developer writes the program code, the running program code may be tested. If an error (BUG) occurs in the code, the application program cannot run correctly, so that the product cannot be on line, and at this time, a software developer needs to perform deep analysis and inspection on the code. The conventional code inspection relies on manual work, and has the defects of high labor cost, long inspection time and no necessity of inspecting error codes.
In view of this, an embodiment of the present application provides a developer platform for performing inspection analysis on codes. Fig. 1 is a schematic diagram of a developer platform according to an embodiment of the present application. It can be seen that after the software developer writes the code, the code can be sent to the server of the developer platform, and the developer platform can analyze and check the code, and can issue online after confirming the code. After the program code is compiled by the software developer through the terminal equipment, the program code can be uploaded to the server, the server detects the program code, the server returns the analysis result to the terminal equipment for display after detecting and analyzing the program code, for example, the server identifies that a certain code section is a suspected error code section, and the position of the code in the program code is sent to the terminal equipment, so that the terminal equipment displays the code section or the suspected error code section is highlighted in the program code.
Fig. 2 is a display diagram of highlighting an error code in an embodiment of the present application, and it can be seen that, in a display interface of a terminal device displaying a program code, a box selection portion is a sequence or a code-marked portion, after a software developer observes the highlight portion in the code, the code can be modified for the highlight portion, without paying attention to a non-highlight portion, so as to help the software developer to improve efficiency.
In the embodiment of the present application, the program code sent by the software developer to the developer platform may be the code of a complete program, such as a complete mobile phone Application (APP), a complete computer program software or a complete service framework; the program code may also be applet code, functional module code embedded within an application program or an operating system kernel or the like; the program code may be a front end code of a web page, a back end code of a web page, or a code segment to be analyzed, etc., and in practical application, may be other codes, which is not limited herein.
In the embodiment of the application, the program code sent by the software developer to the developer platform is sent in the form of a data packet or in the form of a file, and in practical application, the program code can also be sent in an encrypted mode, and the method is not limited in the specification.
In the embodiment of the application, a software developer can send the program codes to the developer platform after the program codes are written, can also send the program codes to the developer platform at intervals of preset time in the process of writing the program codes, obtain real-time feedback of the developer platform and highlight wrong program codes in real time, and can also set a 'checking' virtual button on a software interface of the software developer for writing the codes in practical application, and when the software developer clicks the 'checking' virtual button, the terminal equipment sends the current program codes to a server for checking.
Fig. 3 shows a program display interface on a terminal device of a software developer, where the interface for writing the program has a title bar, a functional board and a main interface, where the software developer can write the program through the main interface, when the software developer wants to check an error in a program code, the software developer can click on a "check" virtual button in the functional board, and trigger a check instruction, and then the terminal device can send the program code to a server for checking according to the check instruction, and highlight an error code segment in the program code in the main interface according to a check result returned by the server.
It will be appreciated that terminal devices include, but are not limited to, cell phones, desktop computers, tablet computers, notebook computers, and palm top computers.
In the embodiment of the present application, code analysis may be code retrieval, code classification, code marking, code error correction, etc., where in the foregoing description, code error correction is taken as an example, and in practical application, other code analysis methods and methods for displaying code analysis results may also be used, for example, after a developer platform receives a program code sent by a software developer, code analysis is performed to obtain categories of different code segments, and then different code segments may be sent to a terminal device in different colors, so that the terminal device displays different code segments in different colors, for example, a main program is represented by red, an embedded function is represented by blue, etc., which is not limited herein.
The manager of the developer platform can log in the developer platform to check the program codes uploaded by the terminal equipment and also check the error code segments of the program codes. Fig. 4 is an interface diagram for a manager to log in a developer platform to view, and it can be seen that an interface displayed on the developer platform may have a title bar, a function block, and a main interface, and that a terminal device identifier, a code language type, and a program code may be displayed on the main interface. It can be understood that the developer platform on the server can accept the program codes uploaded by the plurality of terminal devices, then the program code analysis can be performed on the plurality of terminal devices, and the corresponding analysis results can be sent to the terminal devices.
In the embodiment of the present application, the developer platform may perform code analysis on multiple programming languages, for example, the programming language of the terminal device 1 is php, the programming language of the terminal device 2 is C, and the programming language of the terminal device 3 is computer programming language (Java), where in practical application, the developer platform may also process other computer programming such as c++ language, which is not limited herein.
It can be understood that after the developer platform receives the program code sent by the terminal device, the program code will be analyzed, if the traditional manual code inspection is adopted, the scale of the platform and the application will be greatly limited, if only the inspection rule of the code specification is set, only simple program errors can be analyzed, and complex error codes cannot be analyzed. The embodiment of the application provides a code analysis method and a related device, which are used for solving the technical problem that more complex errors in codes cannot be checked at present.
Fig. 5 is a schematic diagram of an embodiment of a code analysis method according to an embodiment of the present application, and it can be seen that the code analysis method according to the embodiment of the present application includes:
501. Acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
in the embodiment of the application, after the server acquires the code text to be analyzed, the code text to be analyzed is converted into the word vector to be analyzed. The code text to be analyzed can be a code text written by a software developer on the terminal equipment, if code analysis is needed, the software developer sends the written code text to the server through the terminal equipment, after the server receives the code text, the code text is determined to be analyzed according to the instruction of the terminal equipment, and the code text is determined to be the code text to be analyzed. The server converts the code text to be analyzed into N word vectors to be analyzed, which may be converted by a word vector tool, or may be converted by converting the code text to be analyzed into a tag sequence first, and in practical application, there are many ways of converting the text into word vectors, which is not limited in detail herein. The server may establish a mapping relationship between the code text to be analyzed and N word vectors to be analyzed according to the conversion process, for example, the 1 st word in the code text to be analyzed corresponds to the 1 st word vector to be analyzed, the 2 nd word in the code text to be analyzed corresponds to the 2 nd word vector to be analyzed …, and the nth word in the code text to be analyzed corresponds to the nth word vector to be analyzed. Thus, the word vector to be analyzed may constitute a sequence of word vectors to be analyzed as:
The word vector sequence to be analyzed= [ 1 st word vector to be analyzed, 2 nd word vector to be analyzed … nth word vector to be analyzed ];
similarly, after the server acquires the error code text, the error code text is converted into an error word vector. The error code text is preset by an administrator of the developer platform and is used for checking whether error code segments similar to the error code text exist in the text to be analyzed. The manager can set a plurality of error code texts to be stored in the server according to the requirement, so that the server can select proper error code texts to carry out code analysis according to the error detected by the requirement. After receiving the code text to be analyzed and the instruction sent by the terminal device, the server can select an appropriate error code text according to the instruction of the terminal device, and then convert the error code text into an error word vector, wherein the conversion mode is similar to the mode of converting the code text to be analyzed into the word vector to be analyzed, and detailed description is omitted herein. It will be appreciated that error code text generally corresponds to more than one error word vector. For example, the 1 st word in the error code text corresponds to the 1 st error word vector, the 2 nd word in the error code text corresponds to the 2 nd error word vector …, and the M-th word in the error code text corresponds to the M-th error word vector. Thus, the wrong-word vectors may constitute a sequence of wrong-word vectors:
Error word vector sequence= [ 1 st error word vector, 2 nd error word vector … mth error word vector ];
502. n output vectors corresponding to the combined word vectors are obtained through the neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
in the embodiment of the application, the server can generate the combined word vector according to the word vector to be analyzed and the error word vector, then input the combined word vector into the neural network model, and acquire N output vectors corresponding to the combined word vector through the neural network model.
In the embodiment of the application, the combined word vector can be spliced by a plurality of vectors and marks, at least comprises an attention word vector, and can also comprise a word vector to be analyzed, a matching mark and a duty ratio analysis mark.
Fig. 6 is a schematic diagram of generating a combined word vector according to an embodiment of the present application, where it can be seen that the i-th combined word vector may be composed of an i-th attention word vector 601, an i-th word vector to be analyzed 602, an i-th matching identifier 603, and an i-th duty analysis tag 604, where the attention vector is generated by an attention mechanism. The server calculates an attention word vector 601 through an attention mechanism according to the 1 st error word vector 605, the 2 nd error word vector 606 … Mth error word vector and the i th word vector 602 to be analyzed.
The attention word vector and the attention mechanism are described in detail below.
The server can generate corresponding weights according to the tightness degree of the word vector to be analyzed and the error word vector, and then weight the word vector to be analyzed according to the weights to obtain the attention word vector. The degree of correlation of the word vector to be analyzed and the error word vector can be calculated by using the score a i,j The table 1 is a score table of the word vector to be analyzed and the error word vector, and it can be seen that the server can calculate n×m scores according to N word vectors to be analyzed and M error word vectors, and if the server needs to weight the i th word vector to be analyzed to obtain the i th attention word vector, the server can calculate by the following formula:
wherein vector is attention For the ith attention word vector, a i,j Negvector, a score representing the degree of correlation of a word vector to be analyzed with an erroneous word vector j Is the j-th wrong word vector. In connection with table 1, the i-th attention word vector actually calculated by the server is:
the i-th attention word vector= [ a ] i,1 * 1 st error word vector, a i,2 * 2 nd error word vector … a i,M * M-th error word vector];
As can be seen from the above equation, the server actually calculates the ith attention word vector 601 through the attention mechanism by using the 1 st error word vector 605, the 2 nd error word vector 606 … mth error word vector and the ith word vector 602 to be analyzed.
TABLE 1
In the embodiment of the present application, the combined word vector generally further includes a word vector to be analyzed, that is, the combined word vector is formed by splicing an attention word vector and the word vector to be analyzed, that is:
the i-th combined word vector= [ i-th attention word vector, i-th word vector to be analyzed ];
the server may form a mapping relationship of the combined word vector, the attention word vector, and the word vector to be analyzed according to the above procedure, that is, the i-th combined word vector corresponds to the i-th attention word vector to correspond to the i-th word vector to be analyzed.
In the embodiment of the application, after the server calculates N combined word vectors, the combined word vectors may be input into the neural network model to obtain output vectors corresponding to the combined word vectors, and it may be understood that the server inputs the N combined word vectors into the neural network model to obtain N output vectors corresponding to the N combined word vectors, that is, the 1 st combined word vector corresponds to the 1 st output vector, the 2 nd combined word vector corresponds to the 2 nd output vector … nth combined word vector corresponds to the nth output vector.
The neural network model used by the server may be a cyclic neural network model (recurrent neural networks, RNN), a long short-term memory artificial neural network model (LSTM) model, a bidirectional LSTM model, a gated cyclic unit model (gated recurrent uni, GRU), etc., which are not limited herein.
503. Calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
in the embodiment of the application, the server can calculate the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the M error word vectors, and the initial probability can represent the probability that the code text to be analyzed corresponding to the output vector is the initial position of the target code segment. When the target code segment is the code segment closest to the error code text and found from the code text to be analyzed, the server needs to find the starting position and the ending position of the target code segment in the text to be analyzed when the server needs to determine the target code segment. According to the mapping relation between words or symbols in the text to be analyzed and N output vectors, the server selects the output vector with the highest initial probability by calculating the initial probability of the N output vectors, and then the server can acquire the initial position of the target code segment in the text to be analyzed through the output vector with the highest initial probability and the mapping relation. Similarly, the server can acquire the termination position of the target code segment in the text to be analyzed through the output vector with the highest termination probability and the mapping relation.
The server may first obtain one output vector, for example, an ith output vector, then determine a start probability for the ith output vector according to a similarity between the ith output vector and a start portion of an error code block vector, and determine a stop probability for the ith output vector according to a similarity between the ith output vector and a stop portion of the error code block vector, where the error code block vector is determined by M error code word vectors. The server may splice M error code word vectors into an error code block vector, or splice the error code word vectors after weighting, or may input the error code word vectors into a bidirectional LSTM neural network, then splice the vectors output by the neural network after weighting to obtain the error code block vector, which is not limited in the specific point. The error code block vector start portion may be found by the error code block vector and the start weight, and the error code block vector end portion may be found by the error code block vector and the end weight.
FIG. 7 is a schematic diagram of error code block vectors according to an embodiment of the present application, in which error code text is converted into error code word vectors and then into error code block vectors. Generally, an error code text corresponds to an error code block vector, as shown on the left side of FIG. 7 for error code text and on the right for error code blocks.
FIG. 8 is another schematic diagram of error code block vectors in an embodiment of the present application, and it can be seen that the method of the embodiment of the present application can convert program codes in any language into vectors, and the left side of FIG. 8 is error code text, and the right side is error code block. Fig. 7 is a diagram showing conversion of the php language code into a vector, and fig. 8 is a diagram showing conversion of the C language code into a vector.
504. Determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
in the embodiment of the application, the server can acquire the initial position of the target code segment in the text to be analyzed through the output vector with the highest initial probability and the mapping relation, then acquire the final position of the target code segment in the text to be analyzed through the output vector with the highest final probability and the mapping relation, and finally determine the target code segment in the text to be identified according to the initial position and the final position of the target code block.
As shown in FIG. 2, the box-select portion is a target code segment determined by the server, and it can be seen that the start position of the target code segment is "switch" and the end position is "; "the target code segment is the code segment shown in FIG. 2, which the server may highlight. It can be understood that, according to the initial probability of each output vector in the code text to be analyzed, the server determines that the output vector with the highest initial probability is the ith output vector corresponding to the switch, and then the switch can be found according to the mapping relationship between the output vector and the text to be analyzed, so as to determine the initial position of the target code segment, and similarly, the server can determine the end position of the target code segment, so as to determine the target code segment.
505. And generating a code analysis result of the code text to be analyzed according to the target code segment.
In the embodiment of the application, the server may generate a code analysis result of the code text to be analyzed according to the target code segment, where the code analysis result may include a start position identifier and an end position identifier of the target code block to locate the target code segment, and may also be the text of the target code segment, and the specific application is not limited herein.
After the server generates the code analysis result, the code analysis result may be sent to the terminal device of the software developer, so that the terminal device highlights the target code segment in the code analysis result, as shown in fig. 3, the box selection part in the main interface of the code written by the software developer is highlighted, so as to remind the software developer that the highlight part is the target code segment detected by the server.
The method of the embodiment of the application can also be applied to the terminal equipment, and fig. 9 is a flowchart of an embodiment of the application applied to the terminal equipment. The embodiment of the application provides a code analysis method, which is applied to terminal equipment and comprises the following steps:
901. the method comprises the steps that terminal equipment obtains N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
In the embodiment of the application, the terminal equipment writes the terminal equipment used by the software for the software developer, and the terminal equipment can receive the code text input by the software developer through the client used by the software developer and serve as the code text to be analyzed. The error code text may be stored in advance in a database of the terminal device or may be obtained from a server.
Other contents of step 901 in the embodiment of the present application are similar to those of step 501 in the respective embodiments corresponding to fig. 5, and are not repeated here.
902. The terminal equipment acquires N output vectors corresponding to the combined word vectors through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
step 902 in the embodiment of the present application is similar to step 502 in the corresponding embodiments of fig. 5, and will not be described herein.
903. The terminal equipment calculates the initial probability of each output vector in N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
step 903 in the embodiment of the present application is similar to step 503 in the corresponding embodiments of fig. 5, and will not be described herein.
904. The terminal equipment determines an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
step 904 in the embodiment of the present application is similar to step 504 in the corresponding embodiments of fig. 5, and will not be described here again.
905. And the terminal equipment generates a code analysis result of the code text to be analyzed according to the target code segment.
Step 905 in the embodiment of the present application is similar to step 505 in the corresponding embodiments of fig. 5, and will not be described herein.
After the terminal equipment generates the code analysis result of the code text to be analyzed, the corresponding target code segment can be directly highlighted in the client used by the software developer for writing the software.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, further including:
calculating the initial probability of each output vector in N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
According to the N output vectors and the error word vector, calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed comprises the following steps:
determining a starting weight score corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer which is greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining a starting probability according to a starting weight score and a starting weight total score corresponding to the ith output vector;
according to the N output vectors and the error word vector, calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed, wherein the method comprises the following steps:
determining termination weight scores corresponding to the jth output vector according to the jth output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
In the embodiment of the application, the server or the terminal equipment can calculate the initial probability through the output vector, the set initial weight and the error word vector, and the calculation formula is as follows:
wherein P is (start) (i) H is the start probability of the ith output vector i For the ith output vector, W (start) For the starting weight, negvector is the error code block vector,for the initial weight score corresponding to the i-th output vector,is the total score of the starting weights.
The initial weight is a weight preset by a server or terminal equipment through training. The server or the terminal device may directly use the trained initial weight to calculate, and the specific training process may refer to the subsequent embodiment.
In the embodiment of the application, the server or the terminal equipment can calculate the termination probability through the output vector, the set termination weight and the error word vector, and the calculation formula is as follows:
wherein P is (end) (i) Terminating probability for the ith output vector, h i For the ith output vector, W (end) To terminate the weights, negvector is the error code block vector,for the termination weight score corresponding to the i-th output vector,to terminate the total score of weights.
The termination weight is a weight preset by the server or the terminal equipment through training. The server or the terminal device may directly use the trained termination weight to calculate, and the specific training process may refer to the subsequent embodiment.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where determining, according to a start probability of each output vector in a code text to be analyzed and a stop probability of each output vector in the code text to be analyzed, the target code segment includes:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
obtaining an output vector with highest initial probability;
determining the initial position of the target code segment according to the output vector with the highest initial probability and the mapping relation between the output vector and the code text to be analyzed;
wherein determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
Obtaining an output vector with highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
the target code segment is determined based on the start position and the end position.
In the embodiment of the present application, the server or the terminal device may calculate the start probability of each output vector, for example, the start probability of the 1 st output vector is 0.05, the start probability of the 2 nd output vector is 0.1, and the like, and then select the output vector with the highest start probability, for example, the output vector with the highest start probability of the i th output vector, and then the server or the terminal device determines the start position of the target code segment according to the mapping relationship of the i th output vector.
In the embodiment of the present application, the server or the terminal device may calculate the termination probability of each output vector, for example, the termination probability of the 1 st output vector is 0.08, the termination probability of the 2 nd output vector is 0.01, and the like, and then select the output vector with the highest termination probability, for example, the termination probability of the i-th output vector is the highest, and then the server or the terminal device determines the termination position of the target code segment according to the mapping relationship of the i-th output vector.
Table 2 is a data table obtained by calculating the start probability and the stop probability according to the embodiment of the present application, and it can be seen that each output vector can be calculated to obtain the corresponding start probability and stop probability.
TABLE 2
Output vector identification 1 2 N
Probability of initiation 0.05 0.10 0.06
Probability of termination 0.08 0.01 0.05
Fig. 10 is a schematic diagram of calculating a start probability and a stop probability by using an error code block vector and each output vector according to an embodiment of the present application, it can be seen that a server or a terminal device first converts the error word vector into the error code block vector, and then calculates the start probability and the stop probability corresponding to each output vector by using an attention mechanism, that is, a start probability calculation formula and a stop probability calculation formula.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where N output vectors corresponding to a combined word vector are obtained through a neural network model, where the combined word vector is generated according to a word vector to be analyzed and an error word vector, and includes:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
N output vectors corresponding to the combined word vectors are obtained through the neural network model.
In the embodiment of the present application, the server or the terminal device first generates the combined word vector according to the word vector to be analyzed and the error word vector, and the generating manner may refer to each embodiment corresponding to fig. 5, which is not described herein again. The server or the terminal device can calculate the attention score a according to the error word vector and the word vector to be analyzed i,j As shown in table 1. Attention score a i,j There are a number of calculation methods, and embodiments of the present application provide one of the attention scores a i,j The calculation formula of (2) is as follows:
wherein a is i,j For the attention score between the ith vector to be analyzed and the jth error word vector, original_vector i For the i-th word vector to be analyzed, neg vector j For the j-th error word vector, a calculation formula of a multi-layer perfect neural network (MLP) function is as follows:
MLP(x)=max(0,Wx+b),W∈R d×d ,b∈R d
in the MLP function, W and b are weight vectors that are self-learned by the network, and may be obtained through training, and the specific training method is not described herein. In the embodiment of the application, one layer of the multi-layer neural network is generally adopted, and a function of taking the maximum value is adopted.
Fig. 11 is a schematic diagram of calculating an attention score according to an embodiment of the present application, it can be seen that, the server or the terminal device calculates the 1 st attention score 1106 through the 1 st error word vector 1101 and the i th word vector to be analyzed, calculates the 2 nd attention score 1107 through the 2 nd error word vector 1102 and the i th word vector to be analyzed, calculates the 3 rd attention score 1108 through the 3 rd error word vector 1103 and the i th word vector to be analyzed, calculates the 4 th attention score 1109 through the 1 st error word vector 1104 and the i th word vector to be analyzed, and the 1 st error word vector 1101 and the i th word vector to be analyzed have a relatively similar logical relationship and structural relationship, so the 1 st attention score 1106 is relatively high, and the 2 nd error word vector 1102 and the i th word vector to be analyzed have a very similar logical relationship and structural relationship, so the 2 nd attention score 1107 is the highest.
Fig. 12 is a schematic diagram of calculating an attention score according to an embodiment of the present application, it can be seen that the server or the terminal device calculates the 1 st attention score 1206 through the 1 st error word vector 1201 and the j-th word vector to be analyzed, calculates the 2 nd attention score 1207 through the 2 nd error word vector 1202 and the j-th word vector to be analyzed, calculates the 3 rd attention score 1208 through the 3 rd error word vector 1203 and the j-th word vector to be analyzed, calculates the 4 th attention score 1209 through the 1 st error word vector 1204 and the j-th word vector to be analyzed, and the 3 rd error word vector 1203 and the i-th word vector to be analyzed have very close logical relationship and structural relationship, so the 2 nd attention score 1208 is the highest.
Therefore, the server can calculate the attention score through the word vector to be analyzed and the error word vector, and the attention score is used for representing the correlation degree of the word vector to be analyzed and the error word vector. And then the server or the terminal equipment calculates the attention word vector according to the attention score and the error word vector, and finally the attention word vector is spliced with other parts to obtain a combined word vector.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where obtaining N word vectors to be analyzed corresponding to a code text to be analyzed and an error word vector corresponding to an error code text includes:
acquiring a code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the tag sequence to be analyzed;
acquiring a set error code text;
converting the error code text into an error mark sequence, wherein the error mark sequence is formed by converting each word or symbol in the error code text;
generating an error word vector by a word vector tool according to the error marking sequence.
In the embodiment of the application, after the server or the terminal equipment can acquire the code text to be analyzed, the code text to be analyzed is converted into the mark sequence to be analyzed, and the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed. The tag sequence is also referred to as a token sequence, which is a code segment having a type that can determine a semantic representation (e.g., a keyword, a string, or a comment) of text, can be obtained using a conventional lexical analyzer pygments, or can be obtained using a modified lexical analyzer pygments, and is not limited in this regard.
FIG. 13 is a schematic diagram of converting a code text to be analyzed into a markup sequence according to an embodiment of the present application. It can be seen that the left side of fig. 11 is the code text to be analyzed, the right side is the tag sequence to be analyzed, and after the server or the terminal device obtains the code text to be analyzed, the code text is converted into the tag sequence to be analyzed.
It can be understood that the code text to be analyzed has a mapping relationship with the tag sequence to be analyzed, as in fig. 11, if the first word or symbol of the code text to be analyzed is "$type", then it may be converted into the 1 st tag sequence to be analyzed, "Variable assignment", that is, a mapping relationship between "$type" and "Variable assignment", through which the server or the terminal device may read "Variable assignment" according to "$type" or read "$type" according to "Variable assignment".
It will be appreciated that the tag to be analysed in the tag sequence to be analysed may be repeated, for example the tag "Variable assignment" in figure 11 may be repeated a plurality of times. In particular, the conversion of the code text to the tag sequence may be performed in a manner similar to table 3. Table 3 is a data table for converting a code text and a tag sequence in the embodiment of the present application, and it can be seen that when a server or a terminal device detects a text similar to $a in the code text in the conversion process, the text is converted into a tag "Variable assignment", so if multiple texts similar to $a appear in the text, there may be multiple tags "Variable assignment".
TABLE 3 Table 3
token type Code text
Variable assign $a,x
Operator ->,!=,+,-
Keyword for,in,while,return,continue
String “this is an example”
Comment //must be negative
It can be appreciated that after the server or the terminal device converts the code text to be analyzed into the tag sequence to be analyzed, the tag sequence to be analyzed can be converted into the word vector to be analyzed by a word vector tool (word 2vec tool). The labels to be analyzed and the word vectors to be analyzed have a mapping relation, namely, the 1 st label to be analyzed corresponds to the 1 st word vector to be analyzed, the 2 nd label to be analyzed corresponds to the 2 nd word vector to be analyzed … nth label to be analyzed corresponds to the nth word vector to be analyzed.
In the embodiment of the application, after the server or the terminal equipment can acquire the error code text, the error code text is converted into the error mark sequence, and the error mark sequence is formed by converting each word or symbol in the error code text. The tag sequence is also referred to as a token sequence, which is a code segment having a type that can determine a semantic representation (e.g., a keyword, a string, or a comment) of text, can be obtained using a conventional lexical analyzer pygments, or can be obtained using a modified lexical analyzer pygments, and is not limited in this regard. The error code text may be a code text preset to be stored in the server.
It will be appreciated that, similar to the code text to be analyzed, the error code text has a mapping relationship with the error marker sequence.
It will be appreciated that similar to the code text to be analyzed, the false marks in the sequence of false marks may be repeated.
It will be appreciated that after the server or terminal device converts the error code text into an error tag sequence, the error tag sequence may be converted into an error word vector by a word vector tool (word 2vec tool). There is a mapping relationship between the error flags and the error word vectors, i.e. the 1 st error flag corresponds to the 1 st error word vector, the 2 nd error flag corresponds to the 2 nd error word vector … nth error flag corresponds to the nth error word vector.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where the combined word vector further includes a matching identifier, and the matching identifier includes a first matching identifier and a second matching identifier, where the first matching identifier is used to indicate that a to-be-analyzed tag corresponding to the combined word vector matches an error tag in the error tag sequence, and the second matching identifier is used to indicate that the to-be-analyzed tag corresponding to the combined word vector does not match the error tag in the error tag sequence.
In the embodiment of the application, when the server or the terminal equipment generates the combined word vector, the matching identifier can be spliced in the combined word vector. The matching identifiers comprise a first matching identifier and a second matching identifier, the first matching identifier can be a numerical value of 1 and is used for indicating that the to-be-analyzed mark corresponding to the combined word vector is matched with the error mark in the error mark sequence, the second matching identifier can be a numerical value of 0 and the second matching identifier is used for indicating that the to-be-analyzed mark corresponding to the combined word vector is not matched with the error mark in the error mark sequence. The first matching identifier may be a value of 1 and the second matching identifier may be a value of 0 so that the server and the terminal device can quickly identify the matching identifier.
Table 4 is a table example of a to-be-analyzed tag sequence and an error tag sequence when matching identifiers are spliced on a combined word vector in the embodiment of the present application, and it can be seen that, if the to-be-analyzed tag corresponding to the combined word vector and the error tag in the error tag sequence are different, the server or the terminal device splices a second matching identifier on the combined word vector. For example, when the server or the terminal device calculates the combined word vector corresponding to the to-be-analyzed tag "case", the to-be-analyzed tag corresponding to the combined word vector is "case", and the to-be-analyzed tag "case" is different from the error tag "type", "Barek" < ", and the server or the terminal device splices the second matching identifier to the combined word vector corresponding to the to-be-analyzed tag" case ".
For another example, if the to-be-analyzed tag corresponding to the combined word vector matches the error tag in the error tag sequence, the server or the terminal device splices the first matching identifier to the combined word vector. That is, when the server or the terminal device calculates the combined word vector corresponding to the "+" of the to-be-analyzed tag, the to-be-analyzed tag corresponding to the combined word vector is "+", and the to-be-analyzed tag "+" is matched with the error tag "+", the server or the terminal device splices the first matching identifier to the combined word vector corresponding to the to-be-analyzed tag "+".
TABLE 4 Table 4
The marker sequence to be analyzed Error marker sequence
case type
if Barek
+ +
Optionally, on the basis of the respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where the combined word vector further includes a duty ratio analysis tag, and the duty ratio analysis tag identifies a duty ratio of the duty ratio analysis tag in the tag sequence to be analyzed.
In the embodiment of the application, when the server or the terminal equipment generates the combined word vector, the duty ratio analysis mark can be spliced in the combined word vector. Table 5 is a table of a tag sequence to be analyzed in the embodiment of the present application, and it can be seen that when the server or the terminal device generates the combined word vector, the duty ratio analysis tag is spliced into the combined word vector according to the duty ratio of the corresponding tag to be analyzed in the tag sequence to be analyzed being processed. For example, when the server or the terminal device generates the combined word vector corresponding to the mark "if" to be analyzed, the server or the terminal device detects that the mark "if" to be analyzed occupies 3/5 of all the marks to be analyzed in the currently processed mark sequence to be analyzed, and then the server or the terminal device splices the 3/5 mark as the mark to be analyzed into the combined word vector.
TABLE 5
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5 or fig. 9, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where determining, by a neural network model, an output vector of the neural network model according to the combined word vector includes:
acquiring a first word vector sequence formed by arranging combination word vectors in a positive sequence;
acquiring a second word vector sequence formed by arranging the combined word vectors in a reverse order;
the method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating the first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating the second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
In the embodiment of the present application, a server or a terminal device determines an output vector of a neural network model through the neural network model according to the combined word vector, and the server or the terminal device first obtains a first word vector sequence formed by arranging the combined word vectors in a positive sequence, where the first word vector sequence may be:
A first word vector sequence= [ 1 st combined word vector, 2 nd combined word vector … nth combined word vector ];
the server or the terminal device obtains a second word vector sequence formed by arranging the combined word vectors in an inverted order, and the second word vector sequence may be:
second word vector sequence= [ nth combined word vector, nth-1 combined word vector … 1 st combined word vector ];
fig. 14 is a schematic diagram of a server or a terminal device inputting a first word vector sequence into a forward LSTM network model, and it can be seen that after the server inputs the first word vector sequence into the forward LSTM network model, a first output sequence may be obtained, where the first output sequence is:
first output sequence= [ 1 st forward output vector, 2 nd forward output vector, … nth forward output vector ];
the 1 st forward output vector is calculated by the server according to the 1 st word vector through a forward LSTM network model, the 2 nd forward output vector is calculated by the server according to the 2 nd word vector through a forward LSTM network model, and the … nth forward output vector is calculated by the server according to the nth word vector through a forward LSTM network model.
Fig. 15 is a schematic diagram of a server or a terminal device inputting a second word vector sequence into a reverse LSTM network model, and it can be seen that after the server inputs the second word vector sequence into the reverse LSTM network model, a second output sequence may be obtained, where the second output sequence is:
Second output sequence= [ 1 st reverse output vector, 2 nd reverse output vector, … nth reverse output vector ];
the 1 st reverse output vector is calculated by the server according to the 1 st word vector through a reverse LSTM network model, the 2 nd reverse output vector is calculated by the server according to the 2 nd word vector through a reverse LSTM network model, and the … nth reverse output vector is calculated by the server according to the nth word vector through a reverse LSTM network model.
In the embodiment of the application, the neuron structure of the reverse LSTM network model is the same as that of the forward direction, but the state is transferred, and the data input direction is opposite to the forward direction.
After the server inputs the first word vector sequence and the first word vector sequence into the bidirectional LSTM network model, a first output sequence and a second output sequence are obtained, and then the server can splice the first output sequence and the second output sequence together, so that an output vector sequence is obtained. The output vector sequence is formed by arranging output vectors, wherein the output vectors are positive output vectors, and the output vectors are reverse output vectors, namely, the output vectors are formed by splicing the positive output vectors and the reverse output vectors. The 1 st output vector is formed by splicing the 1 st forward output vector and the 1 st reverse output vector, the 2 nd output vector is formed by splicing the 2 nd forward output vector and the 2 nd reverse output vector, and the … nth output vector is formed by splicing the nth forward output vector and the nth reverse output vector. Finally, the server obtains an output vector sequence by splicing:
Output vector sequence= [ 1 st output vector, 2 nd output vector … nth output vector ].
According to the embodiments corresponding to fig. 5 or fig. 9, fig. 16 is a schematic diagram of an application of the code analysis method according to the embodiment of the present application, where the method is applied to a server, and it can be seen that the server first obtains an error code block, which is also called an error code text, and determines whether the error code block is a tag sequence, if yes, an error code block word vector is generated according to the error tag sequence, and if not, the error code block is converted into the tag sequence and then the error code block word vector is generated.
The server can also acquire a source code text, the source code text can also be called as a code text to be analyzed, then the server judges whether the source code text is a mark sequence, if so, a source code word vector is generated according to the source code text, the source code word vector can also be called as a word vector to be analyzed, if not, the server firstly converts the source code text into the mark sequence, and then the source code word vector is generated according to the mark sequence.
It will be appreciated that the server does not have the order of time to acquire the error code blocks and the source code, but typically acquires the error code blocks first.
After the server generates the error code block word vector and the source code word vector, step one may be performed, i.e., generating a combined word vector according to the error code block word vector and the source code word vector through an attention mechanism. The combined word vector is not shown in fig. 16, but the server inputs the combined word vector into a bi-directional LSTM network, i.e., a forward LSTM network and a reverse LSTM network, after generating the combined word vector.
Step one in fig. 16 is similar to step 502 in the respective embodiments corresponding to fig. 5, and detailed descriptions thereof are omitted here.
The server inputs the combined word vector into the bi-directional LSTM network to obtain a concatenated output vector, and on the other hand, the server also converts the error code word vector into an error code block vector. Then, the server predicts the starting position and the ending position of the error code according to the attention mechanism through the error code block vector and the spliced output vector, namely, the step two.
In this application, step two is similar to step 503 and step 504 in each embodiment corresponding to fig. 5, and detailed descriptions thereof are omitted here.
It will be appreciated that after the server obtains the start and end positions of the error code, the connected terminal device may be instructed to highlight the error code, as shown in fig. 3.
Fig. 17 is a schematic diagram of an application example of the code analysis method provided by the embodiment of the present application, it can be seen that a server obtains an error code block, a source code text 1 and a source code text 2, then converts the error code block, the source code text 1 and the source code text 2 into a tag sequence, then performs algorithm prediction, calculates to obtain an object code segment of the source code text 1 and an object code segment of the source code text 2, and then performs highlighting on the source code text 1 and the source code text 2 according to the object code segment of the source code text 1 and the object code segment of the source code text 2, and simultaneously performs corresponding highlighting on the source tag sequence 1 and the source tag sequence 2.
In the foregoing embodiments or application examples, the preset parameters may be obtained through training, the server first randomly initializes each parameter, then trains each parameter in the method flow provided by the embodiment of the present application through the error code text and the training code text, optimizes each parameter, and finally obtains the trained parameter. The training method is described below.
The server firstly acquires error code texts and training code texts, wherein the error code texts and the training code texts can be code texts input in advance by a manager of the developer platform, and the manager searches corresponding error code texts in a plurality of code texts according to the error types which want to train. Typically, the error code text is only a section of code text, and is not complete, the administrator needs to determine a whole section of code (a relatively complete code) where the error code text is located, that is, a training code text, and typically, the administrator needs to mark a start position and an end position of the error code text in the training code text, so that the administrator inputs the error code text, the training code text, and an identifier of the start position and the end position of the error code text in the training code text to the server, and the server obtains the error code text, the training code text, and an identifier of the start position and the end position of the error code text in the training code text.
The server then trains the parameters in the overall method flow, with the task of allowing the network to learn the optimal parameters by minimizing the objective function. The objective function is:
L=-ΣlogP (start) (a start )-ΣlogP (end) (a end );
wherein a is start Representing the starting position of the error code text in the training code text, which can be represented by a corresponding marker sequence, a end The end position of the error code text in the training code text can be represented by a corresponding mark sequence, for example, the mark sequence of the training code is A, B, C, D, E, the mark error code is B, C, D, the start position is 2, and the end position is 4.
The server uses a random gradient descent algorithm to continually compare the position of the start and end of the error code predicted by the algorithm with the correct start and end positions (a start And a end ) In contrast, optimize all of the models mentioned in the algorithm flowAnd (3) continuously changing the values of the unknown parameters to minimize the objective function, and finally, enabling the accuracy of the algorithm model to the prediction result of the training set to be the highest, so that model training is completed.
Fig. 18 is a diagram showing an interface during training of a server, and it can be seen that the server obtains each parameter in the whole trained method flow after multiple rounds of training optimization.
Fig. 19 is a schematic diagram of a device for code analysis according to an embodiment of the present application, where the device 1900 for code analysis according to an embodiment of the present application includes:
an obtaining unit 1901, configured to obtain N word vectors to be analyzed corresponding to a code text to be analyzed and an error word vector corresponding to an error code text, where the error code text represents a code text that matches the code text to be analyzed, and N is an integer greater than 1;
the processing unit 1902 is configured to obtain N output vectors corresponding to a combined word vector through a neural network model, where the combined word vector is generated according to a word vector to be analyzed and an error word vector;
the processing unit 1902 is further configured to calculate, according to the N output vectors and the error word vector, a start probability of each of the N output vectors in the code text to be analyzed, and a stop probability of each of the N output vectors in the code text to be analyzed;
the processing unit 1902 is further configured to determine an object code segment according to the start probability of each output vector in the code text to be analyzed and the end probability of each output vector in the code text to be analyzed;
a generating unit 1903, configured to generate a code analysis result of the code text to be analyzed according to the target code segment.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the processing unit 1902 is further configured to:
calculating the initial probability of each output vector in N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
according to the N output vectors and the error word vector, calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed comprises the following steps:
determining a starting weight score corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer which is greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining a starting probability according to a starting weight score and a starting weight total score corresponding to the ith output vector;
according to the N output vectors and the error word vector, calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed, wherein the method comprises the following steps:
Determining termination weight scores corresponding to the jth output vector according to the jth output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the processing unit 1902 is further configured to:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
obtaining an output vector with highest initial probability;
determining the initial position of the target code segment according to the output vector with the highest initial probability and the mapping relation between the output vector and the code text to be analyzed;
Wherein determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
obtaining an output vector with highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
the target code segment is determined based on the start position and the end position.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the processing unit 1902 is further configured to:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
n output vectors corresponding to the combined word vectors are obtained through the neural network model.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the processing unit 1902 is further configured to:
Acquiring a code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the tag sequence to be analyzed;
acquiring a set error code text;
converting the error code text into an error mark sequence, wherein the error mark sequence is formed by converting each word or symbol in the error code text;
generating an error word vector by a word vector tool according to the error marking sequence.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of a device for generating a code vector, where the combined word vector further includes a matching identifier, and the matching identifier includes a first matching identifier and a second matching identifier, where the first matching identifier is used to indicate that a to-be-analyzed tag corresponding to the combined word vector matches an error tag in the error tag sequence, and the second matching identifier is used to indicate that the to-be-analyzed tag corresponding to the combined word vector does not match an error tag in the error tag sequence.
Optionally, on the basis of the respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of a device for generating a code vector, where the combined word vector further includes a duty ratio analysis tag, and the duty ratio analysis tag identifies a duty ratio of the duty ratio analysis tag in the tag sequence to be analyzed.
Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 19, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the processing unit 1902 is further configured to:
acquiring a first word vector sequence formed by arranging combination word vectors in a positive sequence;
acquiring a second word vector sequence formed by arranging the combined word vectors in a reverse order;
the method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating the first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating the second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
Fig. 20 is a schematic diagram of a server structure according to an embodiment of the present application, where the server 2000 may have a relatively large difference between configurations or performances, and may include one or more central processing units (central processing units, CPU) 2022 (e.g., one or more processors) and a memory 2032, and one or more storage media 2030 (e.g., one or more mass storage devices) storing application programs 2042 or data 2044. Wherein the memory 2032 and the storage medium 2030 may be transitory or persistent. The program stored on the storage medium 2030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 2022 may be arranged to communicate with a storage medium 2030, and execute a series of instruction operations in the storage medium 2030 on the server 2000.
The server 2000 may also include one or more power supplies 2026, one or more wired or wireless network interfaces 2050, one or more input/output interfaces 2058, and/or one or more operating systems 2041 such as Windows server (tm), mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 20.
In the embodiment of the present application, the CPU2022 is specifically configured to perform the following steps:
acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
n output vectors corresponding to the combined word vectors are obtained through the neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
Generating a code analysis result of the code text to be analyzed according to the target code segment;
in an embodiment of the present application, the CPU2022 is further configured to perform the following steps:
calculating the initial probability of each output vector in N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
according to the N output vectors and the error word vector, calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed comprises the following steps:
determining a starting weight score corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer which is greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining a starting probability according to a starting weight score and a starting weight total score corresponding to the ith output vector;
according to the N output vectors and the error word vector, calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed, wherein the method comprises the following steps:
Determining termination weight scores corresponding to the jth output vector according to the jth output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
In an embodiment of the present application, the CPU2022 is further configured to perform the following steps:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
obtaining an output vector with highest initial probability;
determining the initial position of the target code segment according to the output vector with the highest initial probability and the mapping relation between the output vector and the code text to be analyzed;
Wherein determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
obtaining an output vector with highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
the target code segment is determined based on the start position and the end position.
In an embodiment of the present application, the CPU2022 is further configured to perform the following steps:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
n output vectors corresponding to the combined word vectors are obtained through the neural network model.
In an embodiment of the present application, the CPU2022 is further configured to perform the following steps:
acquiring a code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
Generating N word vectors to be analyzed through a word vector tool according to the tag sequence to be analyzed;
acquiring a set error code text;
converting the error code text into an error mark sequence, wherein the error mark sequence is formed by converting each word or symbol in the error code text;
generating an error word vector by a word vector tool according to the error marking sequence.
The combined word vector also comprises a matching identifier, the matching identifier comprises a first matching identifier and a second matching identifier, the first matching identifier is used for indicating that the to-be-analyzed mark corresponding to the combined word vector is matched with the error mark in the error mark sequence, and the second matching identifier is used for indicating that the to-be-analyzed mark corresponding to the combined word vector is not matched with the error mark in the error mark sequence;
the combined word vector also includes a duty cycle analysis tag that identifies the duty cycle of the duty cycle analysis tag in the tag sequence to be analyzed.
In an embodiment of the present application, the CPU2022 is further configured to perform the following steps:
acquiring a first word vector sequence formed by arranging combination word vectors in a positive sequence;
acquiring a second word vector sequence formed by arranging the combined word vectors in a reverse order;
the method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating the first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating the second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (16)

1. A method of code analysis, comprising:
acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
n output vectors corresponding to the combined word vectors are obtained through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors;
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
generating a code analysis result of the code text to be analyzed according to the target code segment;
the calculating, according to the N output vectors and the error word vector, a start probability of each output vector in the N output vectors in the code text to be analyzed, and a stop probability of each output vector in the code text to be analyzed, including:
Calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
the calculating, according to the N output vectors and the error word vector, the initial probability of each output vector in the N output vectors in the code text to be analyzed includes:
determining a starting weight fraction corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining the initial probability according to the initial weight score corresponding to the ith output vector and the initial weight total score;
the calculating, according to the N output vectors and the error word vector, a termination probability of each of the N output vectors in a code text to be analyzed includes:
Determining termination weight scores corresponding to the j-th output vector according to the j-th output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining the termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
2. The method of claim 1, wherein said determining the target code segment based on the start probability of each output vector in the code text to be analyzed and the end probability of each output vector in the code text to be analyzed comprises:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein the determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
Acquiring the output vector with the highest initial probability;
determining the starting position of the target code segment according to the output vector with the highest starting probability and the mapping relation between the output vector and the code text to be analyzed;
wherein the determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
and determining the target code segment according to the starting position and the ending position.
3. The method of claim 1, wherein the obtaining, by the neural network model, N output vectors corresponding to a combined word vector, wherein the combined word vector is generated according to the word vector to be analyzed and the error word vector, includes:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
And acquiring the N output vectors corresponding to the combined word vector through the neural network model.
4. The method of claim 3, wherein the obtaining N word vectors to be analyzed corresponding to the code text to be analyzed and the error word vector corresponding to the error code text comprises:
acquiring the code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the marker sequence to be analyzed;
acquiring the set error code text;
converting the error code text into an error marker sequence, wherein the error marker sequence is formed by converting each word or symbol in the error code text;
and generating the error word vector through the word vector tool according to the error marking sequence.
5. The method of claim 4, wherein the combined word vector further comprises a matching identifier, the matching identifier comprising a first matching identifier for indicating that the to-be-analyzed tag corresponding to the combined word vector matches the error tag in the error tag sequence, and a second matching identifier for indicating that the to-be-analyzed tag corresponding to the combined word vector does not match the error tag in the error tag sequence.
6. The method of claim 4, wherein the combined word vector further comprises a duty cycle analysis tag that identifies a duty cycle of the duty cycle analysis tag at the sequence of tags to be analyzed.
7. The method of claim 3, wherein the obtaining, by the neural network model, the N output vectors corresponding to the combined word vector comprises:
acquiring a first word vector sequence formed by the combination word vector positive sequence arrangement;
acquiring a second word vector sequence formed by arranging the combined word vectors in an inverted order;
the method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
8. An apparatus for code analysis, comprising:
the acquisition unit is used for acquiring N word vectors to be analyzed corresponding to the code text to be analyzed and error word vectors corresponding to the error code text, wherein the error code text represents the code text matched with the code text to be analyzed, and N is an integer greater than 1;
the processing unit is used for obtaining N output vectors corresponding to the combined word vectors through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors; calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
the processing unit is used for determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
the generation unit is used for generating a code analysis result of the code text to be analyzed according to the target code segment;
The processing unit is specifically configured to:
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
the calculating, according to the N output vectors and the error word vector, the initial probability of each output vector in the N output vectors in the code text to be analyzed includes:
determining a starting weight fraction corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
determining the initial probability according to the initial weight score corresponding to the ith output vector and the initial weight total score;
the calculating, according to the N output vectors and the error word vector, a termination probability of each of the N output vectors in a code text to be analyzed includes:
Determining termination weight scores corresponding to the j-th output vector according to the j-th output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
and determining the termination probability according to the termination weight fraction and the termination weight total fraction corresponding to the j-th output vector, wherein j is an integer which is greater than or equal to 1 and less than or equal to N.
9. The apparatus of claim 8, wherein the processing unit is further configured to:
determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed;
determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed;
determining the target code segment according to the starting position of the target code segment and the ending position of the target code segment;
wherein the determining the starting position of the target code segment according to the starting probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest initial probability;
Determining the starting position of the target code segment according to the output vector with the highest starting probability and the mapping relation between the output vector and the code text to be analyzed;
wherein the determining the termination position of the target code segment according to the termination probability of each output vector in the code text to be analyzed comprises:
acquiring the output vector with the highest termination probability;
determining the end position of the target code segment according to the output vector with the highest termination probability and the mapping relation between the output vector and the code text to be analyzed;
and determining the target code segment according to the starting position and the ending position.
10. The apparatus of claim 8, wherein the processing unit is further configured to:
determining a combined word vector corresponding to the word vector to be analyzed according to the word vector to be analyzed and the error word vector, wherein the combined word vector comprises an attention mechanism vector and the word vector to be analyzed, the attention mechanism vector is obtained by weighting according to attention scores of the error word vector and the word vector to be analyzed, and the attention scores are used for representing the correlation degree of the error word vector and the word vector to be analyzed;
And acquiring the N output vectors corresponding to the combined word vector through the neural network model.
11. The apparatus according to claim 10, wherein the acquisition unit is specifically configured to:
acquiring the code text to be analyzed;
converting the code text to be analyzed into a mark sequence to be analyzed, wherein the mark sequence to be analyzed is formed by converting each word or symbol in the code text to be analyzed;
generating N word vectors to be analyzed through a word vector tool according to the marker sequence to be analyzed;
acquiring the set error code text;
converting the error code text into an error marker sequence, wherein the error marker sequence is formed by converting each word or symbol in the error code text;
and generating the error word vector through the word vector tool according to the error marking sequence.
12. The apparatus of claim 11, wherein the combined word vector further comprises a match indicator, the match indicator comprising a first match indicator for indicating that the marker to be analyzed corresponding to the combined word vector matches the error marker in the error marker sequence, and a second match indicator for indicating that the marker to be analyzed corresponding to the combined word vector does not match the error marker in the error marker sequence.
13. The apparatus of claim 11, wherein the combined word vector further comprises a duty cycle analysis tag that identifies a duty cycle of the duty cycle analysis tag at the sequence of tags to be analyzed.
14. The apparatus of claim 10, wherein the processing unit is further configured to:
acquiring a first word vector sequence formed by the combination word vector positive sequence arrangement;
acquiring a second word vector sequence formed by arranging the combined word vectors in an inverted order;
the method comprises the steps of obtaining an output vector sequence corresponding to a first word vector sequence and a second word vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors.
15. A server, the server comprising: memory, transceiver, processor, and bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory, and comprises the following steps:
acquiring N word vectors to be analyzed corresponding to code texts to be analyzed and error word vectors corresponding to error code texts, wherein the error code texts represent code texts matched with the code texts to be analyzed, and N is an integer greater than 1;
n output vectors corresponding to the combined word vectors are obtained through a neural network model, wherein the combined word vectors are generated according to the word vectors to be analyzed and the error word vectors; calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed according to the N output vectors and the error word vector;
determining an object code segment according to the initial probability of each output vector in the code text to be analyzed and the termination probability of each output vector in the code text to be analyzed;
Generating a code analysis result of the code text to be analyzed according to the target code segment;
the calculating, according to the N output vectors and the error word vector, a start probability of each output vector in the N output vectors in the code text to be analyzed, and a stop probability of each output vector in the code text to be analyzed, including:
calculating the initial probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
calculating the termination probability of each output vector in the N output vectors in the code text to be analyzed according to the N output vectors and the error word vector;
the calculating, according to the N output vectors and the error word vector, the initial probability of each output vector in the N output vectors in the code text to be analyzed includes:
determining a starting weight fraction corresponding to the ith output vector according to the ith output vector, the set starting weight and the error word vector, wherein i is an integer greater than or equal to 1 and less than or equal to N;
determining a total score of the initial weights according to the N output vectors, the set initial weights and the error word vectors;
Determining the initial probability according to the initial weight score corresponding to the ith output vector and the initial weight total score;
the calculating, according to the N output vectors and the error word vector, a termination probability of each of the N output vectors in a code text to be analyzed includes:
determining termination weight scores corresponding to the j-th output vector according to the j-th output vector, the set termination weight and the error word vector;
determining total scores of the termination weights according to the N output vectors, the set termination weights and the error word vectors;
determining the termination probability according to a termination weight fraction corresponding to the j-th output vector and the termination weight total fraction, wherein j is an integer greater than or equal to 1 and less than or equal to N;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
16. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of code analysis of any of claims 1-7.
CN201910747791.7A 2019-08-13 2019-08-13 Code analysis method and related device Active CN110427330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910747791.7A CN110427330B (en) 2019-08-13 2019-08-13 Code analysis method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910747791.7A CN110427330B (en) 2019-08-13 2019-08-13 Code analysis method and related device

Publications (2)

Publication Number Publication Date
CN110427330A CN110427330A (en) 2019-11-08
CN110427330B true CN110427330B (en) 2023-09-26

Family

ID=68414522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910747791.7A Active CN110427330B (en) 2019-08-13 2019-08-13 Code analysis method and related device

Country Status (1)

Country Link
CN (1) CN110427330B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536743A (en) * 2020-11-06 2021-10-22 腾讯科技(深圳)有限公司 Text processing method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609009A (en) * 2017-07-26 2018-01-19 北京大学深圳研究院 Text emotion analysis method, device, storage medium and computer equipment
CN108287820A (en) * 2018-01-12 2018-07-17 北京神州泰岳软件股份有限公司 A kind of generation method and device of text representation
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108804421A (en) * 2018-05-28 2018-11-13 中国科学技术信息研究所 Text similarity analysis method, device, electronic equipment and computer storage media
CN109522548A (en) * 2018-10-26 2019-03-26 天津大学 A kind of text emotion analysis method based on two-way interactive neural network
CN109840322A (en) * 2018-11-08 2019-06-04 中山大学 It is a kind of based on intensified learning cloze test type reading understand analysis model and method
CN109871317A (en) * 2019-01-11 2019-06-11 平安普惠企业管理有限公司 Code quality analysis method and device, storage medium and electronic equipment
CN110019784A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN110032736A (en) * 2019-03-22 2019-07-19 深兰科技(上海)有限公司 A kind of text analyzing method, apparatus and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133202A (en) * 2017-06-01 2017-09-05 北京百度网讯科技有限公司 Text method of calibration and device based on artificial intelligence

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609009A (en) * 2017-07-26 2018-01-19 北京大学深圳研究院 Text emotion analysis method, device, storage medium and computer equipment
CN110019784A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN108287820A (en) * 2018-01-12 2018-07-17 北京神州泰岳软件股份有限公司 A kind of generation method and device of text representation
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108804421A (en) * 2018-05-28 2018-11-13 中国科学技术信息研究所 Text similarity analysis method, device, electronic equipment and computer storage media
CN109522548A (en) * 2018-10-26 2019-03-26 天津大学 A kind of text emotion analysis method based on two-way interactive neural network
CN109840322A (en) * 2018-11-08 2019-06-04 中山大学 It is a kind of based on intensified learning cloze test type reading understand analysis model and method
CN109871317A (en) * 2019-01-11 2019-06-11 平安普惠企业管理有限公司 Code quality analysis method and device, storage medium and electronic equipment
CN110032736A (en) * 2019-03-22 2019-07-19 深兰科技(上海)有限公司 A kind of text analyzing method, apparatus and storage medium

Also Published As

Publication number Publication date
CN110427330A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110008342A (en) Document classification method, apparatus, equipment and storage medium
CN112860841B (en) Text emotion analysis method, device, equipment and storage medium
CN107004163A (en) The feature design of mistake driving in machine learning
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN111666766B (en) Data processing method, device and equipment
KR101060973B1 (en) Automatic assessment of excessively repeated word usage in essays
CN111461301A (en) Serialized data processing method and device, and text processing method and device
CN110427464B (en) Code vector generation method and related device
CN111177351A (en) Method, device and system for acquiring natural language expression intention based on rule
US20230237084A1 (en) Method and apparatus for question-answering using a database consist of query vectors
CN111460810A (en) Crowd-sourced task spot check method and device, computer equipment and storage medium
CN115525750A (en) Robot phonetics detection visualization method and device, electronic equipment and storage medium
CN116756041A (en) Code defect prediction and positioning method and device, storage medium and computer equipment
CN109408175B (en) Real-time interaction method and system in general high-performance deep learning calculation engine
Venkatesh et al. Enhancing comprehension and navigation in Jupyter notebooks with static analysis
CN110427330B (en) Code analysis method and related device
CN113220854B (en) Intelligent dialogue method and device for machine reading and understanding
CN112149828B (en) Operator precision detection method and device based on deep learning framework
US11288265B2 (en) Method and apparatus for building a paraphrasing model for question-answering
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
US20210165800A1 (en) Method and apparatus for question-answering using a paraphrasing model
CN110032714B (en) Corpus labeling feedback method and device
CN110782128A (en) User occupation label generation method and device and electronic equipment
CN115455922A (en) Form verification method and device, electronic equipment and storage medium
CN109284483A (en) Text handling method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant