3.1. Cross-Lingual Document Embedding
We propose to address the problem of cross-lingual document embedding as a classification problem that focuses on the use of class labels and comparable data. Our goal is to find the mappings that map different document distributions to the same distribution. In other words, map language-specific document distributions into a shared semantic space. In this framework, a classifier-based method for finding the mappings is applied. We first introduce the definition of cross-lingual document learning and then show how to obtain language-independent document features in a multilabel classification manner in the following subsections.
Suppose L represent the collection of different languages, the j-th document of a language is represented as , and the set of all the documents is represented as . Along with the class labels set , and is a label vector, where C is the number of classes. Thus, for each class , if the document belongs to the c-th class, while if not.
Relevance scores cannot be calculated directly for
from different language
, because they come from different spaces and have different distributions. The main goal is to map
into the same space so that they can be compared with each other. Thus, multilingual document learning is defined as finding the mappings for each language that maps
to shared semantic space. The transformation function that provides the mapping relation is expressed as
,
d is the shared space dimension,
indicates the parameters that need to be learned. For simplicity and clarity of discussion, we refer
as the encoder in the following. Different sources of
X require specific encoders for the same
Y.
Figure 1 shows the distributions of documents in four languages, including English, Italian, Danish, and Vietnamese. Each language has a different distribution. The shared semantic space is constructed by the same supervision signals. Additionally, each document distribution
can be mapped into the shared semantic space through a language-specific encoder. The semantic distances can be calculated when multilingual documents are mapped as embeddings in the same semantic space. Thus, the goal of cross-lingual document embedding is to find the appropriate encoder for each language.
3.2. Shared Semantic Space Constructing
We construct the shared semantic space through the training process of the classification problem. The goal is to find the mappings which can be provided by a linear classifier. Generally, for a vector of inputs
and a predicted vector of labels
, the classification process is to find the transformation relationship
W and
b, based on the “winner-takes-all” decision rule to make the label prediction more accurate as Equation (
1) shows.
The semantic space comes from the decomposition of the transformation matrix
. It is easy to observe that the matrix
W can be decomposed into the product of two matrices
,
and
. Bring them into Equation (
1) to get the following,
which can be regarded as transforming
x into
r-dimensional space through matrix
first, and then completing the classification task through matrix
H which represents the linear relationship between data features and labels. Assuming that the data features
and label
y is given, then through such supervised training,
H can be gradually optimized to improve the prediction accuracy. Similarly, assuming that the matrix
H and the label
y are fixed, by improving the data feature
, the prediction accuracy can also be improved in an iterative manner. In other words, the
r-dimensional space is supervised by the category labels when
H is fixed, and this space is the so-called semantic embedding space. The reason why the
r-dimensional space can be used as the embedding space of the document is that this linear classification rule will guide the points of the same category to be close to each other in the embedding space, and the points of different categories are far away from each other. At the same time, in order to make the shared space more discriminative,
H can be constrained to be an orthogonal matrix, which will guide the orthogonality between different categories in the shared space and make the data more discriminative.
The matrix
maps
x to
r-dimensional space and converts it to
. When the matrix
is regarded as an encoder, it means that the encoder can map
x to the semantic space to get
. To sum up, suppose an encoder
, an orthogonal matrix
H,
is a learnable parameter, then the prediction of the label Equation (
2) becomes
and objective function as follows.
The encoder projects the document into an
r-dimensional embedding representation. Equation (
3) shows that the predicted label can be obtained by multiplying the
r-dimensional vector by
H. In other words, this
r-dimensional space is linearly related to the label space. If the same labels are used as supervision signals, then texts in different languages can be mapped into the same space. Thereby, the correlation of multilingual texts can be calculated for the retrieval task.
3.3. Deep Multilabel Multilingual Document Learning
Note that the projection function is influenced by the input data and supervisory signals, especially the class labels are critical to the projection quality. It is practical to transform the projection problem into a single-label classification problem, where document-level mapping is achieved through a many-to-one category relationship. However, the category labels of such methods are usually one-hot representations with only one dimension, and the labels are orthogonal. Similar label representations hardly exhibit any interpretability during the training stage. Moreover, a phrase could also be regarded as a class label, where ambiguity is inevitable. As a result, two documents with the same label are likely to be from different domains and they only slightly overlap topics. In reality, the content of a document is often complex, and it is difficult to fully represent the document with only one label.
Therefore, we use multilabels as supervision signals to construct the semantic space in this work. On the one hand, it could cover more information than a single label thereby reducing ambiguity. On the other hand, the representation ability of the semantic space can be enhanced. However, there are few multilabel multilingual corpora that are directly available, and there are many multilanguage corpora that have the potential to become multilabel, such as Wikipedia.
To generate labels automatically, a quick way is to use Wikipedia concepts directly but the number of concepts is millions. It is possible to take advantage of linear methods to use all of them as classification labels. However it is not suitable to use millions of tags as an output layer of a deep neural network, and at the same time, the consideration of document connection in a multilabel manner is also difficult. Furthermore, because the classification labels are orthogonal to each other, it is also difficult to consider the natural connections between documents from the same language. Another route is to process the title and get the stem sequence as multilabels. This method is straightforward and efficient and has been used in many studies [
8,
29], but it also brings tens of thousands of labels. Moreover, the title is often concise and relatively general which would lead to uneven data distribution problems as a category label. Alternatively, adding multilabels manually is a feasible way. However, this approach is not only time-consuming and expensive but also difficult to generalize to corpora in other languages and domains.
Therefore, an automatic method must be employed to obtain multiple labels. The latent Dirichlet assignment (LDA) algorithm is a generative probabilistic model of a corpus and an unsupervised method to obtain the topic distributions of the document, which is widely used in natural language processing research. Thus, we could choose to use the LDA method to get supervision signals automatically, while assuming that the topic distributions obtained by LDA are sufficiently accurate. There are several advantages to doing so. First, multilabels can be automatically extracted from the data itself without additional information. Second, the inherent connections between documents in the same language could be involved. Third, it is easy to generalize to other languages. Fourth, the ambiguous interference of manual labels can be excluded. Fifth, the number of categories is controllable. Deep neural network methods can be exploited because of the flexibility of the number of categories. Moreover, according to the topic distributions returned by the LDA method, the number of categories and the number of multilabels could be adjusted. This also brings interpretability as each dimension corresponds to a topic with some vocabulary. For instance, assuming that the number of topics is given as 100 and the number of multilabels is set as 1, then the topic distribution by LDA is a 100-dimensional vector and their sum is 100%. The label is also a 150-dimensional vector, where the position of the topic with the highest probability is set to 1 and the remaining 99 are set to 0, which is a one-hot form. If the number of multilabels is set to 6, the label will then set the 6 topics with the highest probability to 1 and the remaining 94 topics to 0.
Learning from features to input data helps to improve the feature quality of classifier training, which will help to improve the discriminative ability of the shared semantic space. Lei’s work also proved that adding an unsupervised feature by auto-encoder could improve the performance of a linear classifier [
30]. We follow this setting and use the supervised auto-encoder model. Denoting an encoder
with it output
, a decoder
, and an orthogonal matrix
H, then the objective function as follows:
where
is the trade-off parameter between reconstruction error and supervision loss. Each language is trained separately, and the gradient descent algorithm is used to iteratively search for the optimal parameters.
3.4. Implementation
The framework of MDL is summarized in
Figure 2, including two main processes, automatic labeling, and MDL model training. First, concept ids and corresponding documents are extracted from a language-specific Wikipedia dump. The extracted document collection could be seen as a comparable corpus. Each document corresponds to only one concept id, but each concept id corresponds to multiple documents from multiple languages. Meanwhile, the comparable corpus is divided into a training set and a test set. The first process is labeling, a specified language is selected as the criteria and is used to construct the shared space. The topic distributions are automatically obtained through the LDA algorithm. The topic distributions of the training set are transformed into multilabels, which serve as supervision signals for the MDL model. The same supervisory signals are used by different languages, which are transferred by concept ids. The topic distributions of the test set are used to compute the cosine scores for document rankings, thus the retrieval results are obtained. The retrieval results are recorded by concept ids for transfer to other languages.
The second process is the training of the MDL model, and the Doc2Vec method is used for document representations (X) as the model input. X is transformed into the shared space () by an encoder, then the features are used to predict labels (Y’) via the orthogonal matrix (H), which is a linear classifier. Document features () are iteratively optimized in the shared space through the backpropagation of the supervisory signals. The decoder (g) could maintain the semantic consistency of the original language and improve the discrimination of features. Each language is trained individually to map documents to the semantic space. Thus, MDL reduces the amount of data during the training stage and also reduces time costs due to the parallel training. These findings lead to the conclusion that the proposed method reduces the time complexity and computational complexity.