CN116755807A - Page document information processing method and electronic equipment - Google Patents

Page document information processing method and electronic equipment Download PDF

Info

Publication number
CN116755807A
CN116755807A CN202310459684.0A CN202310459684A CN116755807A CN 116755807 A CN116755807 A CN 116755807A CN 202310459684 A CN202310459684 A CN 202310459684A CN 116755807 A CN116755807 A CN 116755807A
Authority
CN
China
Prior art keywords
information
document
page
target page
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310459684.0A
Other languages
Chinese (zh)
Inventor
胡兴
高海慧
赵明祥
高鹏
孟玉峰
潘晨笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Overseas Internet Industry Co ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310459684.0A priority Critical patent/CN116755807A/en
Publication of CN116755807A publication Critical patent/CN116755807A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • G06F9/454Multi-language systems; Localisation; Internationalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the application discloses a page document information processing method and electronic equipment, wherein the method comprises the following steps: receiving current language information and document information which are reported by a client and collected by a target page; judging whether translation abnormality exists in the received text content; determining whether the text at the corresponding position can be managed according to the text information reported by the clients corresponding to the users; generating a document management work order according to documents which belong to a manageable class of documents and have translation abnormality in the target page; and storing the content information of the documents belonging to the manageable documents and the document content information and the auxiliary information thereof which have no translation abnormality in the corresponding language and/or the auxiliary information of the non-manageable documents into a word stock created for the target page so as to report only the document information which is not hit in the word stock. By the embodiment of the application, the page document translation abnormality problem can be examined, and the data quantity required to be reported at the user terminal side is reduced.

Description

Page document information processing method and electronic equipment
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a page document information processing method and an electronic device.
Background
The multilingual presentation of pages is an important ring in internationalization services. For example, for an e-commerce service platform related to cross-border commodity transaction, pages in a website may be browsed by users in different countries, and at this time, it is required to translate a document in the page into a plurality of different languages in advance, so that users in different countries can switch and display the page between the different languages.
However, a website will typically include a very large number of pages, and different pages will correspond to different development teams; it is also possible that a page may include multiple different functional modules, even though these different functional modules may correspond to different development teams, etc. The multi-lingual translation of documents related to a specific page or functional module generally depends on the awareness of the development team of multi-lingual presentation. That is, after the development of a specific page or a functional module is completed, a development team is required to actively interface with a system for providing translation services, obtain translation results of a plurality of languages, store the translation results, and when the translation results are required to be displayed in a certain language, read the translation results in the corresponding language for display so as to meet the multi-language access requirements of different users. However, if the development team forgets to translate the document, or if the language selected during translation is too few to achieve full coverage of multiple common languages, then it may occur that some documents are not displayed in a certain language during the page display process, i.e., there is a "missed turn" situation. In addition, since the translation of the document in the page is often automatically implemented depending on the translation service system, a situation of "false turn" may also occur, that is, a translation result of a part of the document content in a certain language may be expressed in error, and so on.
In order to solve the above-mentioned "missing turn" or "wrong turn" situation that exists in the page, under a mode, can scan and detect the page code before the page is released, in order to avoid this kind of problem to come on line. However, since the release and operation platforms are numerous, the documents are always online to the page in various ways, and thus, it is difficult to realize comprehensive inspection in this way. In addition, it is difficult to avoid problems such as "missed turn" or "false turn" from the source (e.g., constraints on the developer, etc.). Therefore, the phenomenon of 'missing turning' or 'false turning' in the actual page is frequent, and the user experience is seriously affected.
Disclosure of Invention
The application provides a page document information processing method and electronic equipment, which can realize the inspection of page document translation abnormality and reduce the data volume required to be reported by a user terminal side.
The application provides the following scheme:
a page document information processing system, comprising:
the client is used for executing the page script injected into the target page so as to collect the current language information of the target page and the document information in the page in the process of accessing the target page by a user and report the collected document information to the server, wherein the collected document information comprises: the document content information and the auxiliary information of the document, wherein the auxiliary information comprises information for uniquely positioning the document in the target page;
The server is used for judging whether the received file content has translation abnormality according to the current language information, counting file information which is reported by the client and is related to the target page and is in the same language by a plurality of users, judging whether the files at a plurality of different positions in the target page are manageable, and storing file content information and auxiliary information thereof which belong to manageable files and have no translation abnormality in the corresponding language and/or auxiliary information thereof which belong to non-manageable files into a word stock created for the target page;
the word stock is used for synchronizing the word stock to the client so that only the text information which is not hit in the word stock is reported when the text collection is carried out on the target page through the page script.
A page document information processing method includes:
determining word stock information associated with a target page, wherein the word stock is used for storing information of a plurality of entries, and the information of the plurality of entries comprises: the method comprises the steps that the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which does not have translation abnormality in the target language and/or the auxiliary information of the documents belonging to the non-manageable class of documents are/is included in the target page, and the auxiliary information comprises information for uniquely positioning the documents in the target page;
In the process of accessing a target page by a user, collecting current language information of the target page and text information in the page; wherein, the text information that gathers includes: the content information of the document and the auxiliary information of the document;
comparing the acquired text information with entry information in the word stock, and reporting the text information missing the word stock and the current language information to a server side so that the server side can judge whether the received text information has translation abnormality according to the current language information.
Wherein, still include:
and when the word stock associated with the target page is not created or is empty, reporting the whole amount of the document information acquired by the user terminal side to the server.
The word stock is generated and updated by the server side in the following way: and counting the document information reported by the clients corresponding to a plurality of users and related to the target page in the same language, judging whether the documents at a plurality of different positions in the target page are manageable, and storing the document content information and the auxiliary information thereof which belong to the manageable document and have no translation abnormality in the corresponding language and/or the auxiliary information thereof which belong to the non-manageable document into a word stock created for the target page.
The collecting the current language information of the target page and the text information in the page includes:
collecting current language information of the target page and text information in the page under the condition that the user terminal side starts to access the target page and interaction is not generated yet;
and monitoring the change condition of the Document Object Model (DOM) associated with the target page, and sensing the interactive behavior of the user so as to acquire the document information on the changed node when the interactive behavior of the user is sensed.
Wherein, still include:
if page interaction occurs in the process of executing the file information acquisition and reporting task at the user terminal side, and the browser is triggered to refresh the page according to the preset frame rate, the task is segmented into a plurality of subtasks, so that the task is executed in idle time which is inserted into the page drawing task between a plurality of adjacent frames in a scattered manner by taking the subtask as a unit.
The auxiliary information of the document comprises path information of page elements corresponding to the document in a target page DOM;
the method further comprises the steps of:
The method comprises the steps of monitoring the path information change condition of the page elements caused by page interaction behaviors so as to keep the uniqueness of the path information acquired by the same page element.
The auxiliary information of the document further comprises position coordinate information of the document in the target page, so that the server adds a corresponding frame selection mark in the target page according to the position coordinate information after recognizing the document information with abnormal translation.
Before the current language information of the target page and the text information in the page are collected, the method further comprises the following steps:
and determining the element jitter ending time in the target page by performing performance monitoring on the target page so as to trigger the acquisition of the current language information of the target page and the text information in the page after the element jitter is ended.
Wherein, still include:
judging whether the acquired document information is the document corresponding to the hidden element, and if so, canceling reporting of the document information.
Comparing the collected text information with entry information in the word stock, and reporting the text information missing the word stock and the current language information to a server, wherein the method comprises the following steps:
Judging whether a target entry corresponding to the attached information of the currently acquired document exists in the word stock;
if the target entry exists and the document category corresponding to the target entry is a manageable category, and the content of the currently acquired document is consistent with the content of the document in the target entry, determining the acquired document as a document which does not need to be reported; or if the target entry exists and the document class information corresponding to the target entry is an unmanageable class, determining the acquired document as a document without reporting;
and reporting part of the text except the text which does not need to be reported to a server as the text information of the word stock which is not hit.
Wherein, still include:
if the document category corresponding to the target entry is a manageable category and the content of the currently acquired document is inconsistent with the content of the document in the target entry, language identification is carried out on the content of the acquired document;
if the acquired document content is inconsistent with the current language of the target page, adding a translation abnormality identification to the document content, and reporting to a server.
A page document information processing method includes:
Receiving current language information and document information which are reported by a client and are collected by a target page, wherein the document information comprises document content information and document auxiliary information, and the auxiliary information comprises information for uniquely positioning the document in the target page;
judging whether translation abnormality exists in the received text content according to the current language information;
through statistics of the document information, reported by the clients corresponding to a plurality of users, about the target page in the same language, whether the documents at a plurality of different positions in the target page are manageable or not is judged;
generating a document management work order according to the documents which belong to the manageable documents and have translation abnormality in the target page so as to manage the translation abnormality;
and storing the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which do not have translation abnormality in the corresponding language and/or the auxiliary information of the documents belonging to the non-manageable class of documents in a word stock created for the target page, wherein the word stock is used for being provided for the client so that the client only reports the document information which is not hit in the word stock when the client collects the documents of the target page.
The method for judging whether the documents at a plurality of different positions in the target page are manageable or not by counting the document information which is reported by the clients and is related to the target page in the same language, wherein the clients correspond to a plurality of users, and the method comprises the following steps:
judging whether the content of the texts which are checked by different users at the same position of the target page is the same or not, and determining whether the texts at the corresponding positions can be managed or not according to the judging result.
The message information reported by the user terminal side also comprises position coordinate information of the message in the target page;
the method further comprises the steps of:
and after judging that the text information with abnormal translation exists, adding frame selection mark information for the text information into the target page according to the position coordinate information, so that the frame selection mark information is also provided when the text treatment work order is provided.
Wherein, still include:
according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page, controlling a document collection strategy of the target page so that the user terminal side executes document collection and reporting logic according to the document collection strategy; wherein the document collection policy includes a sampling rate of document collection.
The controlling the document collection strategy of the target page according to the user access amount, the code iteration frequency and/or the user operation complexity of the target page comprises the following steps:
and controlling a document collection strategy of the target page according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page, which are respectively corresponding to different stages of the life cycle.
A page document information processing apparatus comprising:
the word stock determining unit is used for determining word stock information associated with the target page, the word stock is used for storing information of a plurality of entries, and the information of the plurality of entries comprises: the method comprises the steps that the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which does not have translation abnormality in the target language and/or the auxiliary information of the documents belonging to the non-manageable class of documents are/is included in the target page, and the auxiliary information comprises information for uniquely positioning the documents in the target page;
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring current language information of a target page and document information in the page in the process of accessing the target page by a user; wherein, the text information that gathers includes: the content information of the document and the auxiliary information of the document;
And the comparison judging unit is used for comparing the acquired text information with entry information in the word stock, reporting the text information which is missed in the word stock and the current language information to a server, so that the server judges whether the received text information has translation abnormality according to the current language information.
A page document information processing apparatus comprising:
the document information receiving unit is used for receiving current language information and document information which are reported by a client and are collected by a target page, wherein the document information comprises document content information and document auxiliary information, and the auxiliary information comprises information for uniquely positioning the document in the target page;
the translation abnormality judging unit is used for judging whether the received text content has translation abnormality or not according to the current language information;
the document category identification unit is used for judging whether documents at a plurality of different positions in the target page can be managed or not by counting the document information which is reported by clients corresponding to a plurality of users and is about the target page in the same language;
The work order generation unit is used for generating a work order for managing the translation abnormality according to the documents which belong to the manageable class of documents and have the translation abnormality in the target page;
the word stock providing unit is used for storing the content information of the files belonging to the manageable files and having no translation abnormality in the corresponding languages and the auxiliary information thereof and/or the auxiliary information of the files belonging to the non-manageable files in a word stock created for the target page, wherein the word stock is used for providing the client so that the client only reports the file information missing the word stock when the client collects the files of the target page.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.
An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding claims.
According to the specific embodiment provided by the application, the application discloses the following technical effects:
according to the embodiment of the application, the target page can be acquired and reported on the client side, and the server side identifies the received text information and judges whether translation abnormality exists. In the process of actually accessing a specific page for multiple times by multiple users, the full coverage of all functions and all operation paths in the page can be realized, so that the information about the full amount of the text of the page can be obtained, and the inspection of the translation abnormality of the text of the page can be further completed. In the early stage of scheme execution, the client side can report the total amount of the acquired documents (or can filter documents without multilingual translation and the like according to a preconfigured rule base), so as to gradually reduce the data amount reported by the client side, and the document information reported by the client side can comprise document content information and subsidiary information of the documents for uniquely positioning the documents; in this way, the server side can also classify the documents at a plurality of different positions in the target page by counting the document information which is reported by the client sides and is related to the target page in the same language and corresponds to a plurality of users, and can determine the manageable documents and the non-manageable documents in the documents; if the treatable document is correctly translated in a specific target page, the document content information and the corresponding auxiliary information of the document can be added into a word stock of the target page, and in addition, the auxiliary information of the non-treatable document can also be added into the word stock. Such a thesaurus may be provided to the client, so that the client may only upload the document information that does not hit the thesaurus to the server when the document collection is performed after having such a thesaurus, i.e. if the document at a certain location belongs to a manageable class and has been correctly translated in the current language and has not changed, or if the document at a certain location belongs to a non-manageable class, no upload is necessary. In a word, through the method, the examination of the translation abnormality problem existing in the target page can be realized, and the number of reports on the client side can be gradually reduced and the task amount on the server side can be gradually reduced through continuous updating and perfecting of the word stock created in the dimension of the target page.
In the process of collecting and reporting the file at the client side, in order to avoid influencing normal interactive rendering of the page, the task can be sliced and then executed in idle time in the page drawing task between different frames in a scattered way, so that page clamping and the like caused by script execution are avoided.
In addition, the consistency and stability of the acquired subsidiary information of the document can be ensured by controlling the acquisition time, monitoring the condition of the path information change of the page element and the like caused in the page interaction process, and the like, so that more accurate judgment results can be obtained when the document category is judged by using the information for the subsequent server.
Of course, it is not necessary for any one product to practice the application to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;
FIG. 2 is a schematic diagram of statistics of the document area types according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a text region type determining process according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a system provided by an embodiment of the present application;
FIG. 5 is a flow chart of a first method provided by an embodiment of the present application;
FIG. 6 is a flow chart of a second method provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a first apparatus provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of a second apparatus provided by an embodiment of the present application;
fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the application, fall within the scope of protection of the application.
In order to facilitate understanding of the specific implementation scheme provided by the embodiment of the present application, it should be first noted that, the "page" in the embodiment of the present application is not a so-called C-terminal (consumer user side) page, but a page with a development view angle, and a page produced by using the same page frame code may be regarded as the same page, specifically, a page obtained by abstract aggregating multiple C-terminal pages with a certain common attribute through some regular rule judging manners. For example, in an e-commerce service system, a developer may provide a "product detail page," where different products correspond to different product detail pages from the perspective of the C-terminal user, but from the perspective of the developer, the product detail pages corresponding to the different products actually belong to different instances of the same page.
In order to be able to patrol out the translation abnormality (e.g. "missed turn", "false turn", etc.) existing in the page, one implementation manner may be a patrol mode, that is, the access process of the user to the page is simulated by the computer, so as to obtain the full amount of the text of the page in a certain language state, then, judge whether the translation abnormality exists, and then, use related development, operation, etc. to treat the problem. However, since there may be many functions included in the same page, and different documents may be presented under different operation paths, if a full amount of documents in the page is to be obtained, it is generally necessary to know about each function of the page and simulate all possible operation paths of the user, so that all possible documents can be enumerated. In practice, however, this approach is often difficult to achieve full coverage of the page document.
Aiming at the problems, the embodiment of the application provides a mode of directly injecting the text collection logic (specifically, the mode can be a JS script and the like) into a specific page, so that the script can be executed by a client to collect relevant text information in the process that a C-terminal user accesses the page actually, specifically, the specific JS script can collect the current language information of the page and traverse page elements in the page, and the collection of the text content can be carried out for each element belonging to the text category. And then, the collected page language information and the acquired text content can be reported to a server, the server discovers translation abnormality problems, generates a corresponding work order and assigns the work order to corresponding development, operation, maintenance personnel and the like so as to treat the text translation problems in the page. That is, although a single access behavior of a single C-terminal user may not cover the full document of a page, in a multiple access process performed by a plurality of C-terminal users on the same page, since specific page functions, operation paths adopted, and the like used by different users may be different at each access, the coverage of the full document of the page may be achieved through multiple access operations of the plurality of C-terminal users.
In a specific implementation, since the number of the page elements of the document class in the page may be very large, the content of the related specific document may be very large, if all documents in the page are reported to the server at each time, the processing overhead of the end side may be relatively large, and in addition, for the server, a large amount of document data is reported to each end side, so the processing overhead of the server may be larger. However, a document in which translation anomalies actually exist is usually only a small part thereof, and thus, full-volume uploading is not necessary.
In view of the above, the present inventors have found that in the process of implementing the present application, the documents in the page can be generally divided into two types, one belonging to the framework document, that is, the document configured by development, operation, etc., and the other belonging to the UGC (User Generated Content, user produced content) document. For the former, belonging to page fixed text, each user sees the same text in the same language; and in the latter, a document that follows the page data change, for example, a commodity title in a commodity detail page, or the like. In the embodiment of the application, when the translation problem of the text in the page is treated, relevant development, maintenance and other personnel are mainly used for treating the frame text, and UGC text belongs to the content produced by users (mainly referred to as merchants and the like) and cannot be treated by the development, maintenance and other personnel in the platform, so that the UGC text actually belongs to the text which does not need to be reported. In addition, for some more stable frame documents in the page, if it can be determined that the document is correctly translated in the corresponding language, the document also belongs to the document which does not need to be repeatedly reported, and the like.
Therefore, in the initial stage of scheme execution, the processing mode adopted in the embodiment of the application is that the C terminal can report a larger amount of document data because the knowledge about the type of the document and whether the translation is correct or not is not acquired yet; however, for the server, besides identifying whether the translation abnormality problem exists in the text file reported by the C-terminal, it is also possible to determine what kind the specific text file belongs to, and generate a "word stock" in the page dimension. That is, each page may correspond to a respective "thesaurus" that may be used to determine the frame documents that a particular page has been properly translated in a particular language, and in addition, which documents belong to the UGC documents. In this way, the word stock can be provided to the C-terminal, and the C-terminal can firstly judge based on the word stock before reporting the document in the process of traversing the page elements, and can not report if a document belongs to a frame document which has been correctly translated in the current language or a UGC document. Therefore, only the frame file with abnormal translation can exist in the file actually needed to be reported by the C terminal. Of course, in practical application, the word stock is gradually perfected, and the number of the messages to be reported at the C end is gradually reduced along with the gradual perfection of the word stock. Moreover, the C terminal can gradually identify the situation of 'missing turning over', and the situation can be marked when the situation is reported, and the server terminal does not need to identify the situation.
In order to achieve the above purpose, the recognition of the document category is a key ring of realization. Specifically, the judgment can be performed according to different performance characteristics of different types of documents when accessed by different users. For example, for page frame documents, the performance characteristics are: when different users visit, the part of the texts is identical, namely the same text can be visible by a large number of clients, and the content and the relative position are the same or similar; while for UGC documents, the content seen by different users is typically different at the same relative location. For example, as shown in fig. 2, it is assumed that when three users access the same page, the document content at different positions appears to three users, and three different situations may be represented, where one situation is "must be present", that is, the document content visible to all users in the same language; the other is "show-up", i.e., the content of the text visible to some users in the same language; yet another is "only present", i.e., in the same language, only one user is visible to the content of the document. Wherein, for the document content of 'must now' it can be generally judged as the frame content; for "show" and "limit" document content, it may be determined as "UGC" content, and so on.
Therefore, for the purpose of identifying the category of the document, when the C-terminal reports the document, in addition to reporting the content of the specific document, the C-terminal may report the auxiliary information of the specific document, where the specific auxiliary information may be information for locating the specific page element, for example, may specifically include the element in the DOM (Document Object Model, document object model, which provides a description of document structuring, and links the page with script, programming language, where each element is regarded as a node) path information in the tree, and so on, for determining the position of the element in the page. When the uniqueness and stability of the production of the element positions are maintained, the data reported by the client sides corresponding to a plurality of users can be processed in the mode, and the frame documents and UGC documents in the page are determined. Specifically, the server side can determine whether the content of the file on the element corresponding to the path belongs to the frame file or the UGC file by comparing whether the content of the file corresponding to the element of the same path is the same when the same page is accessed by different users. After the frame text is identified, the server can mark the text in language, synchronize the text to the cloud word stock and match the text with the acquired auxiliary information, namely page language, so that the missing text can be determined. For the missed page file, the position in the page can be marked according to the corresponding auxiliary information and the like, the position can be stored in a page screenshot mode and the like, a corresponding treatment work order is generated, and specific missed page problems are treated by corresponding development, maintenance and other personnel. In addition, the server side can also identify the false turn problem through a related algorithm. In order to facilitate the server to mark the problematic document content, when the client side reports the document information, the client side can collect the path information of the document and also can collect the position coordinate information of the document in the page, so that after the server side finds that the document content is abnormal, the client side can directly select the document content according to the position coordinate information reported by the client side, for example, add a rectangular frame and the like.
In a specific implementation, since the document information reported on the client side may specifically include document content and auxiliary information, the document content and the auxiliary information may be combined into a document unit form to be expressed, and each document information may correspond to one document unit. That is, a document unit may store related specific document information and auxiliary information. Wherein, the auxiliary information is related to the path of the element in the DOM, so the auxiliary information can also be called DOM information of the document.
In particular implementation, in order to facilitate positioning of DOM elements, a "DOM fingerprint" may be generated for particular DOM information (i.e., the DOM information may be converted into a character string by an Algorithm such as MD5 (Message-Digest Algorithm)), where the DOM fingerprint may be used as an ID capable of uniquely marking a type of DOM element. The DOM fingerprint is used as a unique mark and can be used for positioning the position of the document.
In addition, the document content on a specific element may be labeled with a "document fingerprint" (also, the document text may be converted into a character string by an algorithm such as MD5, etc.), that is, a document may correspond to a document fingerprint. Thus, for a document unit, the document fingerprint+dom fingerprint can be marked, and the specific stored information is specific document content, path, position and other information. When the specific client side reports the document information, the document unit can be used as a unit for reporting. Under the condition that a word stock corresponding to a specific page exists, the client side traverses each node in the process of traversing the nodes in the specific DOM, and after information such as content, path and position of the text in the node is taken out, the information can be compared with the information in the word stock to judge whether the text is a UGC text or not, or a frame text which is correctly translated in the current language is judged, if the text is not the frame text, reporting can be carried out. That is, the client side may actually report: and judging the frame text which possibly has the translation abnormality problem through the word stock. After receiving the document unit information reported by the specific client side, the server side can compare the document unit information reported by other client sides based on the document fingerprint+DOM fingerprint mark of the document unit, and judge whether the document unit information belongs to a frame document or a UGC document. Specifically, the fingerprint generation algorithms used by the client sides corresponding to different users are the same, so if the fingerprints are the same document original text and the same DOM information, the corresponding generated document fingerprints and DOM fingerprints are the same. If the same DOM fingerprint corresponds to different document fingerprints in the document units reported on the client sides of two users, it can be proved that the document texts seen by the two users at the positions corresponding to the DOM fingerprints in the page are different, and the documents possibly belong to UGC documents.
That is, specifically, when the document type is identified, as shown in fig. 3, after receiving the document unit data reported by multiple client sides, the server may perform time period aggregation on the data reported by different client sides according to the document fingerprint, DOM fingerprint, page fingerprint and the current language information in each document unit, and determine whether the corresponding document content is the same or not according to the same region position of the page accessed by different users. For example, the foregoing must be present, even present, only present, and the like, may be included. Then, determining the category of the text appearing in the area according to the area type, namely, for the area corresponding to the must-appear type, wherein the text belongs to the frame text; for documents that appear in the even or only areas, it may be determined to be UGC documents, and so on.
It should be noted that, in a specific page, there may be some documents that do not require multilingual translation, for example, text Logo (Logo) or currency symbol (e.g., this, $, etc.) of some pages. Therefore, besides providing the word stock for the client side so that the client side can judge whether reporting of a document unit is needed, a relevant rule base can be provided for the client side for configuring the special case. Wherein, since whether a specific document belongs to the content which does not need multilingual translation or not is related to the position where the specific document appears, that is, the same document has different meanings when appearing at different positions in a page, and whether translation is needed or not is different. Thus, the following information can be saved in such a rule base: the page, the document and the UI area to which the document belongs uniquely mark a section of specific document in this way and are regularly configured, for example, the rule of specific configuration can include whether reporting is needed or not, and the like. For such rule bases, the rule bases can be configured to the client side in the early stage of executing the scheme, that is, when the client side has no related word base available, some message units which do not need to be reported can be filtered out firstly based on the rule bases.
The above description of the overall implementation framework of the scheme provided by the embodiment of the present application, as shown in fig. 1, may include steps of document collection and reporting at the client side, document category identification at the server side, translation abnormality identification (including language identification and the like to identify problems such as missing and wrong turning), word stock generation/update, etc., where the generated word stock may be provided to the client side, so as to determine whether to report or not when reporting document collection and reporting, so as to gradually reduce the number of documents reported at the client side.
The method specifically relates to some detail problems in the process of collecting and reporting the file at the client side, and is described below. As described above, the collection of the page document can be realized by the script such as JS injected into the page, and specific realization logic includes traversing each node in the DOM tree, identifying text nodes therein, collecting the document auxiliary information, reporting the document, and the like; under the condition that a word stock and a rule stock corresponding to a specific page are generated, query matching of the word stock and the rule stock is needed to judge whether reporting is needed to be carried out on a certain document unit or not. It can be seen that the amount of tasks that the script needs to perform is still significant. The execution of the script is to block the UI rendering interaction, that is, the rendering interaction of the browser is paused in the process of executing the script, but if the pause time is too long, the page may be blocked, and the user experience is affected. Therefore, how to avoid the normal rendering interaction affecting the page on the premise of completing each task through script execution needs to be considered.
For the above problems, the problem of node traversal efficiency may be solved first, specifically, the problem may be achieved by writing traversal codes, or may be achieved by using a browser native API (Application Programming Interface ). For example, the browser natively supported NodeItator API document (), document. TreeWalker, etc. can support traversal quickly.
In addition to determining the traversal pattern, a specific traversal start timing is also a consideration. That is, it is necessary to determine when to begin traversing in order to avoid impacting the page rendering interactions. In the rendering interaction process (e.g., page scrolling, etc.), the browser may generally refresh the page at a certain frame rate, and in order to make the browser feel imperceptible to the naked eye, the refresh rate may be generally defined to be 30-60 hz, that is, 30-60 times in 1s, and if the refresh rate is calculated according to 60hz, that is, the calling time interval is 1s/60, that is, 16.6ms.
As described above, the execution of JS blocks the UI rendering interaction, so that the total execution time of JS cannot be longer than the "macroscopic" duration in the rendering process of one frame of the browser in order not to affect the page rendering interaction. However, even the fastest traversal approach, typically requires 80ms to complete a traversal, well beyond 16.6ms, and thus, obviously affects page rendering interactions.
In this case, in the preferred embodiment of the present application, this can be achieved by slicing the JS execution logic and distributing the sliced sub-tasks over multiple frames. That is, the traversal of the DOM is a logic unit that can be sliced, and each logic unit is the processing logic for each element, and each unit is executed for a time less than 1ms, so that it is executed in a scattered manner to each frame, so that the negative influence (i.e. the click on) on the browser rendering interaction can be solved to the greatest extent. Specifically, since the browser draws a bitmap for each frame, i.e., at intervals, an idle time is typically reserved for such drawing tasks. Therefore, in the embodiment of the application, the processing logic to be executed can be split into a plurality of small task units, and each task unit can be inserted into the idle time of the drawing task of each frame. In this way, the execution time required by each small unit is very short and can be shorter than the idle time, so that the influence on the rendering interaction of the page can be avoided. Each small task unit can acquire auxiliary information of a document set generated on a specific traversed node, judge based on a word stock and a rule base, and determine whether to report or not. In addition, in the concrete implementation, the consumption condition of the executable duration of each frame can be perceived through the browser native API-Request animation frame and the like, the current cartoon refreshing condition and the like of the browser can be perceived through the browser native API, so that the execution duration of each slice logic is controlled, and whether the task allocation of the logic unit needs to be allocated to the next frame or not is judged, so that the problem of no sense of naked eyes is achieved.
In addition to slicing specific JS execution logic, the specific timing of initiating JS execution is also a consideration. Since the DOM tree of the page may increase or decrease or change with the user interaction, which also means that a new document may appear with the interaction, in view of the user, two occasions need to be considered for the collection of the document: and after the user enters the page, the document collection under no interaction is performed, and the document collection after interaction is performed. Therefore, in specific implementation, after the user enters the page, the JS acquisition script can be started, and traversal can be started from the root node of the page (the traversing process can be performed in the manner described above), so that all the document acquisitions without interaction can be solved.
And the collection of the interacted text can be realized by monitoring the DOM change of the page. The monitoring scheme for the changes of the page DOM can be realized by adopting the browser API-Mutation Observer, and can monitor and call back the changes of the sub DOM increase and decrease, the element attribute, the text character content and the like of the page DOM. In this way, whether user interaction occurs can be determined by monitoring the change condition of the DOM, and if so, specific document collection logic can be triggered. At this time, the document information on the changed node in the DOM can be collected.
In addition, in practical application, page jitter may also occur during the page display process, so that the position coordinates of the same element in the page may be shifted. In the embodiment of the application, the position coordinate information of the specific element may need to be acquired, so that the server side can conveniently perform frame selection on the abnormal document content in the specific page, therefore, the page jitter condition can be judged particularly when the acquisition logic is triggered, and the acquisition is performed after the page is jittered, that is, the correct position coordinate information on the document content can be acquired only after the page is jittered. The method can be realized through the API of the browser when judging whether the page is jittered or not. For example, the browser API-Performance Observer can be used to monitor the performance of the page, wherein the layout-shift of the pair can be effectively used to determine the end timing of the dithering of the page elements.
Furthermore, regarding path information of an element, the path information of the element may also change when a user interacts with a page. For example, when a mouse Hover hovers over an element, new class attributes may be generated, the xpath location of the element may change, and so on. In the embodiment of the application, the text needs to be positioned according to the path information of the element, so that the influence of element attribute change caused by interaction such as Hover, clicking and the like can be eliminated, and the same element is prevented from generating different paths. In specific implementation, the browser API-Mutation Observer can be adopted to monitor the attribute affecting the path positioning of the element, and filter dynamic attribute change, that is, for one element, the path information of the element is based on the path information acquired for the first time, and in the subsequent interaction process, even if the path information of the same element changes, the path information cannot be acquired again, so that the consistency and the uniqueness of the path generation logic of the element are maintained.
In addition, some hidden elements may be included in the page, for example, when the page is implementing logic such as drop-down, menus, etc., some elements may appear to exist in the DOM, but are not substantially revealed to the user. For this type of document, the collection can be canceled as an invalid document. In particular for the determination of hidden elements, a simple way may be to determine their style visual attributes. However, the front end has many schemes of hiding elements, such as attaching a covering element, so the display area ratio of the elements can be calculated from the view port angle by using the browser API-Intersection Observer, and the hidden elements can be identified by using the method (for example, the hidden elements can be regarded as when the ratio of the elements exposed in the visible range of the window is smaller than a certain threshold value, etc.).
Based on the foregoing description, the following describes the solution provided by the embodiments of the present application from a plurality of different angles, respectively.
Example 1
As described above, in a specific implementation, the page script may be related to logic such as collecting and reporting a document at the user terminal side, or may be related to logic such as a server side for performing language identification, document category identification, and generating/updating a word stock. Thus, in the first embodiment, from the perspective of the system mainly composed of the page script and the server, a page document information processing system is provided, and referring to fig. 4, the system may specifically include:
The client 401 is configured to collect, by executing a page script injected into a target page, current language information of the target page and document information in the page in a process that a user accesses the target page, and report the collected document information to the server 402, where the collected document information includes: the document content information and the auxiliary information of the document, wherein the auxiliary information comprises information for uniquely positioning the document in the target page;
the server 402 is configured to determine whether the received document content has a translation abnormality according to the current language information, and determine whether documents at a plurality of different positions in the target page are manageable by counting document information, reported by the client corresponding to a plurality of users, about the target page in the same language, and store document content information and auxiliary information thereof, and/or auxiliary information thereof, which belong to a manageable document and have no translation abnormality in a corresponding language, in a word stock created for the target page;
the word stock is used for synchronizing the word stock to the client so that only the text information which is not hit in the word stock is reported when the text collection is carried out on the target page through the page script.
In the first embodiment, the specific implementation scheme provided by the embodiment of the application is mainly described from the overall aspect. It should be noted that, the specific target page may be any page with multiple language display requirements. In addition, as described above, from the development perspective, the specific page may be an aggregation of C-terminal pages, for example, regarding the product detail page, it is a specific piece of product from the perspective of the C-terminal user, and different products correspond to different product detail pages, but from the development perspective, the product detail pages of different products are actually generated by the same page frame code, so that the page frame code belongs to the same page, and the functions of collecting, reporting and the like of the document in the page can be realized by injecting the specific page script into the same page frame code.
In addition, in the embodiment of the application, the collection and reporting of the document information are performed at the client side in the process that the user actually accesses the specific target page; however, in particular, the acquisition and reporting may not be performed for each access process of each user, but may be performed at a certain sampling rate. The sampling rate used may also be different for different target pages.
In particular, when the document collection at the client side is mainly used for finding the translation abnormality problem possibly existing in the page, a large amount of collection needs to be carried out from the client corresponding to different users, so that the collected data can reach a stable and reliable state. However, if the page document achieves data stabilization and credibility, two key factors need to be considered: iterative updating of pages, data size. From a "traffic" and "operational complexity" perspective, none of the solidified sample rates are fully adaptable for different pages and therefore can be performed by a flexible computational means.
For example, in an e-commerce service system, typical page types include a meeting place activity page, a middle background page, an article class page, and the like. The meeting place active page belongs to a page with a relatively fast function iteration, the article page belongs to a page with a relatively stable page structure, a relatively slow code update page, and the like. For the pages with different iteration speeds, for example, for the pages with stable page structure and slow code update, the sampling rate may gradually decrease, while for the pages with fast iteration speed, the sampling rate may be higher, so that the translation abnormality problem possibly existing in the pages may be found more timely.
In addition, even for the same page, different acquisition strategies (mainly, sampling rate, etc.) may be employed at different times. In order to ensure the effectiveness of the data and avoid the change of the data caused by the code change, a specific acquisition strategy can be adjusted along with a complete life cycle. For example, for a page, the lifecycle may include a plurality of different phases, such as a traffic observation phase, a stable iteration phase, a change observation phase, a change stability phase, and the like. In a specific implementation manner, the respective corresponding acquisition strategies may be respectively:
flow observation period: adopting a solidified acquisition scheme to adapt to scene pages with different flow rates and different operation complexity;
stable iteration period: the stable iteration period is as the name implies, namely page document data tend to be stable, so that a relatively stable acquisition strategy can be adopted;
change observation period: the application release, code change and the like can lead to the invalidation of the document part and the position change, the acquisition proportion should be dynamically adjusted at the stage, and the adjustment can be carried out by combining with manually set interactive link strategies and the like;
change stabilization period: after the change, the method is stabilized again, the period can be observed for a period of time, the historical obsolete data can be cleaned, and then the method can gradually enter a stabilized acquisition strategy.
It should be noted here that, regarding a specific page script SDK (Software Development Kit ), dynamic injection can be performed by NGINX (which is a high-performance HTTP (Hyper Text Transfer Protocol, hypertext transfer protocol) and Web server of the reverse proxy), so that the page can be affected minimally. In this case, the management of a specific life cycle may also be hosted to the NGINX server for unified control.
In addition, in the embodiment of the present application, the document data collected from the client side may be referred to as an "observation set", that is, it is found from the data in the "observation set" which documents in a specific page belong to a frame document (related to the frame document, there may be other naming manners, which are collectively referred to as a manageable document in the embodiment of the present application), which belong to a UGC document (corresponding to an uncleanable document in the embodiment of the present application) that does not need to be managed, and so on. After the document data is observed and stabilized, a set of information such as a frame document-language with correct translation and/or a path corresponding to the UGC document (the document content of the UGC document does not need to be saved) can be precipitated from different page dimensions, and the data tends to be stabilized, that is, a concept of a word stock (or a term stock can be formed, and the term stock can be also called a correct set because the information of the document with correct translation in the page is included). Different pages can correspond to different word banks, the word banks can be scattered to page dimensions and synchronized to a client side, so that a specific script can cancel reporting of a document hitting the word banks in the process of collecting the document, and the reported data volume is effectively reduced.
In addition, under the condition that a word stock of page dimensions is formed, the problem identification process of the document can be transferred from offline service end identification to client side, namely, on the client side, end side calculation is carried out based on the word stock, and the problems of missing and wrong turning of the document are judged, so that the identification result of the document translation abnormality problem can be carried out on the terminal side when the document is reported. For example, in the case of the "missed turn" problem, the word stock can be easily identified at the client side, and at this time, the identified text with the problem can be uploaded to the server after the corresponding mark is added. And for the problem of false overturn, the problem may have higher recognition difficulty, and the problem may be directly reported to the server, and specific recognition is performed by the related algorithm of the server. The method can greatly reduce the offline data pressure of the server side and further improve the universality of the scheme.
In addition, as described above, since there may be some documents such as "logo", currency symbol, etc. in a specific page that do not require multilingual translation, there may be "rule sets" in addition to the aforementioned "observation sets", "correct sets". Specifically, the rule set may be configured by personnel such as development and operation of the page, and specifically, a specific document may be uniquely calibrated and configured in a manner of "page+document+ui (User Interface) area to which the document belongs".
In a word, the method can collect and report the document information of the target page at the client side, and the server identifies the received document information and judges whether the translation abnormality problem exists. In the process of actually accessing a specific page for multiple times by multiple users, the full coverage of all functions and all operation paths in the page can be realized, so that the information about the full amount of the text of the page can be obtained, and the inspection of the translation abnormality of the text of the page can be further completed. In the early stage of scheme execution, the client side can report the total amount of the acquired documents (or can filter documents without multilingual translation and the like according to a preconfigured rule base), so as to gradually reduce the data amount reported by the client side, and the document information reported by the client side can comprise document content information and subsidiary information of the documents for uniquely positioning the documents; in this way, the server side can also classify the documents at a plurality of different positions in the target page by counting the document information which is reported by the client side and is related to the target page and is in the same language and corresponds to a plurality of users, and the manageable documents and the non-manageable documents in the documents can be determined; if the treatable document is correctly translated in a specific target page, the document content information and the corresponding auxiliary information of the document can be added into a word stock of the target page, and in addition, the auxiliary information of the non-treatable document can also be added into the word stock. Such a thesaurus may be provided to the client side, so that the client side, when collecting the documents after having such a thesaurus, may only upload the document information missing the thesaurus to the server, i.e. if the document at a certain location belongs to a manageable class and has been correctly translated in the current language and has not changed, or if the document at a certain location belongs to a non-manageable class, may not be necessary to upload. In a word, through the method, the examination of the translation abnormality problem existing in the target page can be realized, and the number of reports on the client side can be gradually reduced and the task amount on the server side can be gradually reduced through continuous updating and perfecting of the word stock created in the dimension of the target page.
Example two
The solution provided by the embodiment of the present application is described mainly from the perspective of the overall system in the above embodiment, but in the second embodiment, mainly from the perspective of the page script of the client side, a page document information processing method is provided, and referring to fig. 5, the method specifically may include:
s501: determining word stock information associated with a target page, wherein the word stock is used for storing information of a plurality of entries, and the information of the plurality of entries comprises: the method comprises the steps of obtaining content information of a document belonging to a manageable class and having no translation abnormality in a target language and auxiliary information thereof in the target page, and/or auxiliary information of a document belonging to a non-manageable class, wherein the auxiliary information comprises information for uniquely positioning the document in the target page.
In the second embodiment, the word stock information associated with the specific target page may be preconfigured, or may be created and gradually updated by the server according to the collected document information as described in embodiment one. Of course, the two modes can be combined, that is, the development of the page or operators can pre-configure some basic word libraries for specific pages, and the follow-up server gradually updates and perfects the word libraries in the process of collecting the file information reported by the client side, and the like.
Under the condition that the word stock is gradually generated and updated by the server, in the early stage of scheme execution, the word stock may not be created for a specific page, or the word stock associated with the specific page may be empty, and at this time, the document information acquired by the client side may be reported to the server in full quantity. Therefore, the server side can judge whether the documents at a plurality of different positions in the target page are manageable or not by counting the document information which is reported by a plurality of client sides and is related to the target page in the same language. For example, in a specific implementation, according to the document information about the target page in the same language reported by multiple client sides, it may be determined whether the document contents viewed by different users at the same location of the target page are the same, so as to determine whether the document at the corresponding location is manageable according to the determination result (for example, whether the document belongs to a frame document or a UGC document may be determined). And then, the content information of the documents belonging to the manageable documents and the subsidiary information thereof without translation abnormality under the corresponding language and/or the subsidiary information of the documents belonging to the non-manageable documents can be stored into a word stock created for the target page.
S502: in the process of accessing a target page by a user, collecting current language information of the target page and text information in the page; wherein, the text information that gathers includes: document content information and document auxiliary information.
Under the condition that the word stock is obtained, the current language information of the target page and the text information in the page can be collected in the process that the user accesses the target page. In the specific process of collection, the current language information of the target page and the document information in the page can be collected in a state that the client side starts to access the target page and interaction is not generated yet. In addition, the change condition of the document object model DOM associated with the target page can be monitored, and the interaction behavior of the user can be perceived, so that when the interaction behavior of the user is perceived, the document information on the changed node can be collected.
In addition, if page interaction occurs in the process of executing the file information acquisition and reporting task at the client side, and the browser is triggered to refresh the page according to the preset frame rate, the task is segmented into a plurality of subtasks, so that the task is executed in idle time in the page drawing task which is inserted between a plurality of adjacent frames in a scattered manner by taking the subtask as a unit.
The text auxiliary information to be collected specifically may include path information (path) of the page element corresponding to the text in the DOM; at the moment, the uniqueness of the path information acquired by the same page element can be maintained by monitoring the path information change condition of the page element caused by the page interaction behavior, so that disturbance to an acquisition result due to the interaction condition is avoided.
In addition, the specific collected auxiliary information of the document can further include: and the server adds a corresponding frame selection mark in the target page according to the position coordinate information after identifying the file information with abnormal translation. In a preferred mode, before the current language information of the target page and the text information in the page are collected, performance monitoring can be performed on the target page, and element shake ending time in the target page can be determined, so that after element shake is ended, the collection of the current language information of the target page and the text information in the page is triggered.
S503: comparing the acquired text information with entry information in the word stock, and reporting the text information missing the word stock and the current language information to a server side so that the server side can judge whether the received text information has translation abnormality according to the current language information.
After the document set on the specific node is collected, the collected document information can be compared with entry information in the word stock, and then the document information which is not hit in the word stock and the current language information can be reported to a server. That is, if a particular collected document belongs to a frame document and has been correctly translated in the current language and has not changed, no reporting is required. Or if the acquired file belongs to the UGC file, reporting is not needed. Specifically, when judging the document type, the judgment can be performed according to the auxiliary information such as each entry and the corresponding path recorded in the word stock. For example, after the information of a certain acquired document is acquired, the path information of the document can be acquired, so that whether an entry corresponding to the path information exists can be judged from a lexicon, if so, the category corresponding to the path can be determined, and if the category belongs to an unmanageable category, the report of the acquired document can be directly canceled; if the acquired text content belongs to the manageable class, whether the text content of the currently acquired text is consistent with the text content corresponding to the path in the word stock or not can be judged (wherein if the text fingerprint is stored in the word stock, the acquired text content can be converted into the text fingerprint and then compared), if the text content is consistent, the acquired text is proved to be correctly translated and unchanged, and the reporting of the text can be canceled. Otherwise, if it is found that after determining the path of the currently collected document, the vocabulary entry corresponding to the path does not exist in the vocabulary library, the document may be a newly added document which has not been identified by the server, or may belong to a document which has been identified by the server as having abnormal translation, but has not been managed by related development or operation, etc., and thus all belong to a document to be reported.
For the text to be reported, if the text category corresponding to the target entry hit in the word stock is a manageable category, and the content of the currently collected text is inconsistent with the content of the text in the target entry, language identification can be further performed on the client side on the collected text, and if the content of the collected text is inconsistent with the current language of the target page, a translation abnormality identification can be further added to the content of the text and then reported to the server.
In addition, whether the acquired document information is the document corresponding to the hidden element can be judged, and if so, the report of the document information is canceled.
Example III
The third embodiment provides a page document information processing method from the perspective of a server, referring to fig. 6, the method specifically may include:
s601: receiving current language information and document information which are reported by a client and are collected by a target page, wherein the document information comprises document content information and document auxiliary information, and the auxiliary information comprises information for uniquely positioning the document in the target page;
s602: judging whether translation abnormality exists in the received text content according to the current language information;
S603: through statistics of the document information, reported by the clients corresponding to a plurality of users, about the target page in the same language, whether the documents at a plurality of different positions in the target page are manageable or not is judged;
s604: generating a document management work order according to the documents which belong to the manageable documents and have translation abnormality in the target page so as to manage the translation abnormality;
s605: and storing the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which do not have translation abnormality in the corresponding language and/or the auxiliary information of the documents belonging to the non-manageable class of documents in a word stock created for the target page, wherein the word stock is used for being provided for the client so that the client only reports the document information which is not hit in the word stock when the client collects the documents of the target page.
In specific implementation, whether the content of the text which is checked by different users at the same position of the target page is the same or not can be judged, and whether the text at the corresponding position can be treated or not can be determined according to the judging result.
The text information reported by the client side can also comprise position coordinate information of the text in the target page, and after judging that the text information with abnormal translation exists, frame selection mark information for the text information can be added into the target page according to the position coordinate information, so that when the text management work order is provided, the frame selection mark information is also provided.
In addition, in the concrete implementation, a document collection strategy of the target page can be determined according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page, so that the client side can execute document collection and reporting logic according to the document collection strategy; wherein the document collection policy includes a sampling rate of document collection.
Specifically, the document collection policy of the target page can be determined according to the user access amount, the code iteration frequency and/or the user operation complexity of the target page corresponding to different stages of the life cycle.
For the details of the second and third embodiments, which are not described in detail, reference may be made to the description of the first embodiment and other portions of the present specification, and the details are not repeated here.
It should be noted that, in the embodiment of the present application, the use of user data may be involved, and in practical application, the user specific personal data may be used in the solution described herein within the scope allowed by the applicable legal regulations in the country under the condition of meeting the applicable legal regulations in the country (for example, the user explicitly agrees to the user to notify practically, etc.).
Corresponding to the embodiment, the embodiment of the application also provides a page document information processing device, referring to fig. 7, the device may include:
a thesaurus determining unit 701, configured to determine thesaurus information associated with a target page, where the thesaurus is used to store information of a plurality of thesaurus, and the information of the plurality of thesaurus includes: the method comprises the steps that the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which does not have translation abnormality in the target language and/or the auxiliary information of the documents belonging to the non-manageable class of documents are/is included in the target page, and the auxiliary information comprises information for uniquely positioning the documents in the target page;
the acquisition unit 702 is configured to acquire current language information of a target page and document information in the page in a process of accessing the target page by a user; wherein, the text information that gathers includes: the content information of the document and the auxiliary information of the document;
and a comparison and judgment unit 703, configured to compare the collected text information with entry information in the lexicon, and report the text information missing the lexicon and the current language information to a server, so that the server can judge whether the received text information has a translation abnormality according to the current language information.
In particular, the apparatus may further include:
and the full-volume reporting unit is used for reporting the full volume of the document information acquired by the client to the server when the word stock associated with the target page is not created or is empty.
The word stock is generated and updated by the server side in the following way: and counting the document information reported by the clients corresponding to a plurality of users and related to the target page in the same language, judging whether the documents at a plurality of different positions in the target page are manageable, and storing the document content information and the auxiliary information thereof which belong to the manageable document and have no translation abnormality in the corresponding language and/or the auxiliary information thereof which belong to the non-manageable document into a word stock created for the target page.
In particular, the acquisition unit may be specifically configured to:
collecting current language information of the target page and text information in the page under the condition that the client starts to access the target page and interaction is not generated yet;
and monitoring the change condition of the Document Object Model (DOM) associated with the target page, and sensing the interactive behavior of the user so as to acquire the document information on the changed node when the interactive behavior of the user is sensed.
In order to avoid the influence of the script execution process on the normal page rendering interaction, the device may further include:
and the task slicing processing unit is used for slicing the task into a plurality of subtasks if page interaction occurs in the process of executing the file information acquisition and reporting task by the client and triggering the browser to refresh the page according to the preset frame rate, so that the task is executed in idle time in the page drawing task which is inserted between a plurality of adjacent frames in a scattered manner by taking the subtask as a unit.
The auxiliary information of the document comprises path information of page elements corresponding to the document in a target page DOM;
the apparatus may further include:
the path change condition monitoring unit is used for monitoring the path information change condition of the page elements caused by the page interaction behavior so as to keep the uniqueness of the path information acquired by the same page elements.
In addition, the auxiliary information of the document further comprises position coordinate information of the document in the target page, so that the server adds a corresponding frame selection mark in the target page according to the position coordinate information after recognizing the document information with abnormal translation.
At this time, the apparatus may further include:
and the page jitter judging unit is used for determining the element jitter ending time in the target page by monitoring the performance of the target page before the current language information of the target page and the text information in the page are acquired, so that the acquisition of the current language information of the target page and the text information in the page is triggered after the element jitter is ended.
Furthermore, the apparatus may further include:
and the hidden element judging unit is used for judging whether the acquired document information is the document corresponding to the hidden element, and if so, the reporting of the document information is canceled.
Wherein, the comparison judging unit may specifically be configured to:
judging whether a target entry corresponding to the attached information of the currently acquired document exists in the word stock;
if the target entry exists and the document category corresponding to the target entry is a manageable category, and the content of the currently acquired document is consistent with the content of the document in the target entry, determining the acquired document as a document which does not need to be reported; or if the target entry exists and the document class information corresponding to the target entry is an unmanageable class, determining the acquired document as a document without reporting;
And reporting part of the text except the text which does not need to be reported to a server as the text information of the word stock which is not hit.
In addition, the apparatus may further include:
the language identification unit is used for carrying out language identification on the acquired document content if the document category corresponding to the target entry is a manageable category and the content of the currently acquired document is inconsistent with the content of the document in the target entry;
and the marking unit is used for adding a translation abnormality identification to the acquired document content and reporting the translation abnormality identification to the server if the acquired document content is inconsistent with the current language of the target page.
Corresponding to the embodiment, the embodiment of the application also provides a page document information processing device, referring to fig. 8, the device may include:
a document information receiving unit 801, configured to receive current language information and document information collected about a target page and reported by a client, where the document information includes document content information and auxiliary information of a document, and the auxiliary information includes information for uniquely positioning the document in the target page;
a translation abnormality judging unit 802, configured to judge whether a translation abnormality exists in the received document content according to the current language information;
A document category identifying unit 803, configured to determine whether documents at a plurality of different positions in the target page are manageable by counting document information about the target page in the same language, which is reported by clients corresponding to a plurality of users;
a work order generating unit 804, configured to generate a work order for managing the translation abnormality according to the documents belonging to the manageable class of documents and having the translation abnormality in the target page;
the word stock providing unit 805 is configured to store, in a word stock created for the target page, content information of a document belonging to a manageable document and having no translation abnormality in a corresponding language, and/or auxiliary information thereof, and/or auxiliary information of a non-manageable document, where the word stock is used to provide the word stock to the client, so that when the client collects a document on the target page, only report document information that is missing in the word stock.
Wherein, the document category identification unit may specifically be configured to:
judging whether the content of the texts which are checked by different users at the same position of the target page is the same or not, and determining whether the texts at the corresponding positions can be managed or not according to the judging result.
In specific implementation, the message information reported by the client side also comprises position coordinate information of the message in the target page; at this time, the apparatus may further include:
and the frame selection marking unit is used for adding frame selection marking information of the text information in the target page according to the position coordinate information after judging that the text information with abnormal translation exists, so that the frame selection marking information is also provided when the text management work order is provided.
In addition, the apparatus may further include:
the strategy control unit is used for controlling a document collection strategy of the target page according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page so that the client side executes the collection and reporting logic of the document according to the document collection strategy; wherein the document collection policy includes a sampling rate of document collection.
Specifically, the policy control unit may specifically be configured to:
and controlling a document collection strategy of the target page according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page, which are respectively corresponding to different stages of the life cycle.
In addition, the embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the method of any one of the previous method embodiments.
And an electronic device comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.
Fig. 9, among other things, illustrates an architecture of an electronic device, for example, device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, an aircraft, and so forth.
Referring to fig. 9, device 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.
The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or part of the steps of the methods provided by the disclosed subject matter. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operations at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and the like. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 906 provides power to the various components of the device 900. Power supply components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 900.
The multimedia component 908 comprises a screen between the device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals.
The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the device 900. For example, the sensor assembly 914 may detect the on/off state of the device 900, the relative positioning of the components, such as the display and keypad of the device 900, the sensor assembly 914 may also detect the change in position of the device 900 or one component of the device 900, the presence or absence of user contact with the device 900, the orientation or acceleration/deceleration of the device 900, and the change in temperature of the device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate communication between the device 900 and other devices, either wired or wireless. The device 900 may access a wireless network based on a communication standard, such as WiFi, or a mobile communication network of 2G, 3G, 4G/LTE, 5G, etc. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the device 900 to perform the methods provided by the disclosed subject matter. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The page document information processing method and the electronic device provided by the application are described in detail, and specific examples are applied to explain the principle and the implementation mode of the application, and the description of the above examples is only used for helping to understand the method and the core idea of the application; also, it is within the scope of the present application to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the application.

Claims (14)

1. A page document information processing system, comprising:
the client is used for acquiring current language information of the target page and document information in the page and reporting the acquired document information to the server by executing the page script injected into the target page in the process that the user accesses the target page, wherein the acquired document information comprises: the document content information and the auxiliary information of the document, wherein the auxiliary information comprises information for uniquely positioning the document in the target page;
the server is used for judging whether the received file content has translation abnormality according to the current language information, counting file information which is reported by the client and is related to the target page and is in the same language by a plurality of users, judging whether the files at a plurality of different positions in the target page are manageable, and storing file content information and auxiliary information thereof which belong to manageable files and have no translation abnormality in the corresponding language and/or auxiliary information thereof which belong to non-manageable files into a word stock created for the target page;
The word stock is used for synchronizing the word stock to the client so that only the text information which is not hit in the word stock is reported when the text collection is carried out on the target page through the page script.
2. A page document information processing method, characterized by comprising:
determining word stock information associated with a target page, wherein the word stock is used for storing information of a plurality of entries, and the information of the plurality of entries comprises: the method comprises the steps that the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which does not have translation abnormality in the target language and/or the auxiliary information of the documents belonging to the non-manageable class of documents are/is included in the target page, and the auxiliary information comprises information for uniquely positioning the documents in the target page;
in the process of accessing a target page by a user, collecting current language information of the target page and text information in the page; wherein, the text information that gathers includes: the content information of the document and the auxiliary information of the document;
comparing the acquired text information with entry information in the word stock, and reporting the text information missing the word stock and the current language information to a server side so that the server side can judge whether the received text information has translation abnormality according to the current language information.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the word stock is generated and updated by the server side by the following modes: and counting the document information reported by the clients corresponding to a plurality of users and related to the target page in the same language, judging whether the documents at a plurality of different positions in the target page are manageable, and storing the document content information and the auxiliary information thereof which belong to the manageable document and have no translation abnormality in the corresponding language and/or the auxiliary information thereof which belong to the non-manageable document into a word stock created for the target page.
4. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the collecting the current language information of the target page and the text information in the page comprises the following steps:
collecting current language information of the target page and text information in the page under the condition that the user terminal side starts to access the target page and interaction is not generated yet;
and monitoring the change condition of the Document Object Model (DOM) associated with the target page, and sensing the interactive behavior of the user so as to acquire the document information on the changed node when the interactive behavior of the user is sensed.
5. The method as recited in claim 2, further comprising:
if page interaction occurs in the process of executing the text information acquisition and reporting task, and the browser is triggered to refresh the page according to the preset frame rate, the task is segmented into a plurality of subtasks, so that the task is executed in idle time which is inserted into the page drawing task between a plurality of adjacent frames in a scattered manner by taking the subtask as a unit.
6. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the auxiliary information of the document comprises path information of page elements corresponding to the document in the DOM of the target page;
the method further comprises the steps of:
the method comprises the steps of monitoring the path information change condition of the page elements caused by page interaction behaviors so as to keep the uniqueness of the path information acquired by the same page element.
7. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the auxiliary information of the document further comprises position coordinate information of the document in the target page, so that the server adds a corresponding frame selection mark in the target page according to the position coordinate information after recognizing the document information with abnormal translation.
8. The method according to any one of claim 2 to 7, wherein,
comparing the collected text information with entry information in the word stock, and reporting the text information missing the word stock and the current language information to a server, wherein the method comprises the following steps:
judging whether a target entry corresponding to the attached information of the currently acquired document exists in the word stock;
if the target entry exists and the document category corresponding to the target entry is a manageable category, and the content of the currently acquired document is consistent with the content of the document in the target entry, determining the acquired document as a document which does not need to be reported; or if the target entry exists and the document class information corresponding to the target entry is an unmanageable class, determining the acquired document as a document without reporting;
and reporting part of the text except the text which does not need to be reported to a server as the text information of the word stock which is not hit.
9. A page document information processing method, characterized by comprising:
receiving current language information and document information which are reported by a client and are collected by a target page, wherein the document information comprises document content information and document auxiliary information, and the auxiliary information comprises information for uniquely positioning the document in the target page;
Judging whether translation abnormality exists in the received text content according to the current language information;
through statistics of the document information, reported by the clients corresponding to a plurality of users, about the target page in the same language, whether the documents at a plurality of different positions in the target page are manageable or not is judged;
generating a document management work order according to the documents which belong to the manageable documents and have translation abnormality in the target page so as to manage the translation abnormality;
and storing the content information of the documents belonging to the manageable class of documents and the auxiliary information thereof which do not have translation abnormality in the corresponding language and/or the auxiliary information of the documents belonging to the non-manageable class of documents in a word stock created for the target page, wherein the word stock is used for being provided for the client so that the client only reports the document information which is not hit in the word stock when the client collects the documents of the target page.
10. The method of claim 9, wherein the step of determining the position of the substrate comprises,
the step of judging whether the documents at a plurality of different positions in the target page are manageable or not by counting the document information reported by a plurality of user terminal sides and related to the target page in the same language comprises the following steps:
Judging whether the content of the texts which are checked by different users at the same position of the target page is the same or not, and determining whether the texts at the corresponding positions can be managed or not according to the judging result.
11. The method of claim 9, wherein the step of determining the position of the substrate comprises,
the message information reported by the user terminal side also comprises position coordinate information of the message in the target page;
the method further comprises the steps of:
and after judging that the text information with abnormal translation exists, adding frame selection mark information for the text information into the target page according to the position coordinate information, so that the frame selection mark information is also provided when the text treatment work order is provided.
12. The method as recited in claim 9, further comprising:
according to the user access quantity, the code iteration frequency and/or the user operation complexity of the target page, controlling a document collection strategy of the target page so that the user terminal side executes document collection and reporting logic according to the document collection strategy; wherein the document collection policy includes a sampling rate of document collection.
13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method of any of claims 2 to 12.
14. An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 2 to 12.
CN202310459684.0A 2023-04-24 2023-04-24 Page document information processing method and electronic equipment Pending CN116755807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310459684.0A CN116755807A (en) 2023-04-24 2023-04-24 Page document information processing method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310459684.0A CN116755807A (en) 2023-04-24 2023-04-24 Page document information processing method and electronic equipment

Publications (1)

Publication Number Publication Date
CN116755807A true CN116755807A (en) 2023-09-15

Family

ID=87952142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310459684.0A Pending CN116755807A (en) 2023-04-24 2023-04-24 Page document information processing method and electronic equipment

Country Status (1)

Country Link
CN (1) CN116755807A (en)

Similar Documents

Publication Publication Date Title
JP6630276B2 (en) Measuring User Behavior and Involvement Using User Interface on Terminal Devices
US20140280603A1 (en) User attention and activity in chat systems
WO2021022689A1 (en) Information collection method and apparatus
KR20140015460A (en) Adaptive notifications
US20180342019A1 (en) Method and device for acquiring transaction record, and computer readable storage medium
WO2013090718A1 (en) Multi-user login for shared mobile devices
CN111026490B (en) Page rendering method and device, electronic equipment and storage medium
US20220300698A1 (en) Techniques for web framework detection
US20240073222A1 (en) Techniques for managing projects and monitoring network-based assets
CN112136099A (en) Direct input from remote device
CN112399006B (en) File sending method and device and electronic equipment
CN110297681A (en) Image processing method, device, terminal and storage medium
CN112073301B (en) Method, device and computer readable medium for deleting chat group members
US20180300572A1 (en) Fraud detection based on user behavior biometrics
US20230289511A1 (en) Mobile device and method
CN112528185A (en) Comment information display method and device, server and terminal
US20150205767A1 (en) Link appearance formatting based on target content
CN111695516A (en) Thermodynamic diagram generation method, device and equipment
CN113342755A (en) Display control method and device
CN112817817A (en) Buried point information query method and device, computer equipment and storage medium
KR102569998B1 (en) Method for managing notifications of applications and an electronic device thereof
CN116755807A (en) Page document information processing method and electronic equipment
US11404030B2 (en) Dynamic view management in visualizations
CN110990095A (en) Hosted application presentation method, device and computer readable medium
US20230143734A1 (en) Detecting anomalies in visualizations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240221

Address after: Room 303, 3rd Floor, Building 5, No. 699 Wangshang Road, Changhe Street, Binjiang District, Hangzhou City, Zhejiang Province, 310052

Applicant after: Hangzhou Alibaba Overseas Internet Industry Co.,Ltd.

Country or region after: China

Address before: Room 554, 5 / F, building 3, 969 Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Applicant before: Alibaba (China) Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right