AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction

Zhang, Wei; Liu, Jiaxuan; Deng, Wendi; Tang, Siyu; Yang, Fan; Han, Ying; Liu, Min; Wan, Renzhuo

doi:10.3390/electronics13204080

Open AccessArticle

AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction

by

Wei Zhang

^1,2,

Jiaxuan Liu

¹,

Wendi Deng

¹,

Siyu Tang

³,

Fan Yang

³,

Ying Han

¹,

Min Liu

⁴ and

Renzhuo Wan

^1,2,*

¹

School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan 430200, China

²

Hubei Key Laboratory of Digital Textile Equipment, Wuhan Textile University, Wuhan 430200, China

³

School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, China

⁴

State Key Laboratory of Powder Metallurgy, School of Physics and Electronics, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 4080; https://doi.org/10.3390/electronics13204080

Submission received: 22 September 2024 / Revised: 9 October 2024 / Accepted: 12 October 2024 / Published: 17 October 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate prediction of electricity consumption is crucial for energy management and allocation. This study introduces a novel approach, named Attention-based Multivariate Temporal Convolutional Network (AMTCN), for electricity consumption forecasting by integrating attention mechanisms with multivariate temporal convolutional networks. The method involves feature extraction from diverse time series of different feature variables using dilated convolutional networks. Subsequently, attention mechanisms are employed to capture the correlation and contextually important information among various features, thereby enhancing the model’s predictive accuracy. The AMTCN method exhibits universality, making it applicable to various prediction tasks in different scenarios. Experimental evaluations are conducted on four distinct datasets, encompassing electricity consumption and weather temperature aspects. Comparative experiments with LSTM, ConvLSTM, GRU, and TCN—widely-used deep learning methods—demonstrate that our AMTCN model achieves significant improvements of 57% in MSE, 37% in MAE, 35% in RRSE, and 12% in CORR metrics, respectively. This research contributes a promising approach to accurate electricity consumption prediction, leveraging the synergy of attention mechanisms and multivariate temporal convolutional networks, with broad applicability in diverse forecasting scenarios.

Keywords:

attention mechanism; TCN; multivariate temporal convolutional networks; electricity consumption forecasting; deep learning

1. Introduction

Electricity is a fundamental component of modern life and production. Currently, the majority of electricity in the world is generated from non-renewable sources such as coal, oil, and natural gas. However, with the development of human society, non-renewable energy sources are becoming increasingly scarce, and their usage also leads to serious environmental pollution. Utilizing clean and renewable energy is one of the important pathways to address these issues. In modern life, renewable energy sources utilized by humans mainly include wind energy, solar energy, hydropower, ocean energy, and biomass energy [1], which are characterized by fluctuations in spatial and time evolution. Therefore, accurate prediction of electricity consumption can not only help government agencies tackle environmental pollution but also assist energy suppliers in rational distribution and planning of renewable and non-renewable energy production and supply. This will enable a balanced relationship between energy production and consumption, reducing energy waste and production costs.

In recent years, scholars worldwide have proposed numerous forecasting methods for the analysis and prediction of electricity consumption. These methods can be mainly categorized into three types: time series analysis, machine learning, and deep learning. Commonly used time series analysis methods include ARIMA [2,3] and exponential smoothing [4]. Such models primarily employ linear methods and may have limited fitting capabilities when dealing with complex nonlinear relationships. On the other hand, machine learning has demonstrated better predictive performance when faced with more intricate nonlinear relationships. The traditional machine learning methods encompass random forests [5], XGBoost [6], and support vector machines [7,8], among others.

With the development of big data technology, deep learning has shown remarkable performance in the field of electricity consumption analysis and prediction. Compared to traditional time series analysis methods and machine learning, deep learning has stronger expressive power and adaptability. Ref. [9] used deep-stacked unidirectional LSTM and bidirectional LSTM networks to predict electricity load consumption. Ref. [10] proposed a hybrid prediction model based on improved complete ensemble empirical mode decomposition (ICEEMDAN) and a gate recurrent unit (GRU) for predicting parking demand. Ref. [11] introduced a novel sequence-to-sequence model using the Attention-based Gated Recurrent Unit for predicting wind power. Ref. [12] presented a bidirectional gate recurrent unit model suitable for multi-load consumption evaluation, integrating the input of the evaluation model with temporal convolutional networks to achieve a larger receptive field. In recent years, driven by the surge of the Transformer model [13], many studies related to electricity prediction have started exploring and applying the Transformer model and its variants. Ref. [14] proposed combining the Transformer architecture with a wavelet for predicting wind energy and wind speed for the next 6 h. Ref. [13] improved the existing Transformer model using the Time2vec method to embed the month sequence more effectively and applied it to total electricity consumption forecasting, achieving higher accuracy than Informer [15] and Autoformer [16]. Ref. [17] extracted spatial dependencies using graph neural architecture and learned temporal correlations using different update functions, one of which is the Transformer architecture. Furthermore, they proposed a modified Transformer architecture called the Fast Fourier Transform (FFT) and demonstrated its competitive performance in wind forecasting. The self-attention mechanism and sequence modeling capability of Transformers make them powerful tools for processing time series data. Inspired by Transformers, this research employs multi-head self-attention mechanisms to capture the correlations and important information among different features in multivariate time series, thereby improving the modeling and predictive capabilities of time series data.

For specific prediction tasks, such as electricity consumption or other domains, the data often involves large volumes and varying dimensions. These problems are categorized as univariate and multivariate, initially proposed for the binary classification of ARIMA models [18]. The electricity consumption prediction and other domain forecasting tasks in this study fall into the category of multivariate time series forecasting. Traditional time series analysis and machine learning approaches struggle to effectively handle nonlinear relationships and complex feature representations, making them less efficient for multivariate time series problems compared to deep learning [19]. Common deep network architectures for handling multivariate time series include Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Transformers. However, traditional CNNs fail to capture long-term dependencies and global information from sequences, while RNNs suffer from issues like vanishing and exploding gradients and lack of parallel computation. To address these problems inspired by attention mechanisms, we propose a novel hybrid model, called AMTCN, which combines attention mechanisms with temporal convolutional networks. The code is available at this repository: https://github.com/man3LLL/AMTCN (accessed date: 11 November 2023). AMTCN incorporates dilated convolutions to effectively address the issue of CNNs not capturing long-term dependencies and global information without encountering vanishing or exploding gradients. Furthermore, the attention mechanism enhances the model’s ability to extract essential information from different features in multivariate time series data. The main contributions of this work can be summarized as follows:

We propose a new hybrid model, AMTCN, which consists of two components: residual blocks with dilated convolutions and an attention module. The model is successfully applied in the field of electricity consumption forecasting using multi-step prediction.
We conduct comparative experiments between the proposed model and various competitive baseline models.
We evaluate the performance of AMTCN on multivariate time series datasets from the electricity and environmental domains (electricity consumption, electricity demand and temperature). The experimental results demonstrate that our AMTCN model exhibits stronger predictive and generalization capabilities on different multivariate time series datasets from diverse domains.

The rest of the paper is organized as follows. Section 2 introduces the baseline models used in this study and related work. Section 3 provides a detailed description of the sequence problem and the individual components of the AMTCN model. In Section 4, we elaborate on the datasets used for experiments, experimental settings, and results analysis. Finally, conclusions are presented in Section 5.

2. Related Work

This section mainly introduces the issues with the four baseline models used in this study, namely LSTM, GRU, ConvLSTM, and TCN. We provide a detailed description of the TCN network model.

LSTM enhances long-term memory by introducing three types of gates: input gate, forget gate, and output gate, to maintain and update cell states, addressing the vanishing and exploding gradient problem-based standard RNN models [20]. However, LSTM also has some drawbacks. For instance, when dealing with long sequences, LSTM’s computational complexity is high, requiring longer training time and more computational resources, making it challenging to achieve parallel computation. Additionally, although LSTM is designed to address long-term dependencies, it may still struggle to capture such relationships in certain cases [21].

Compared to LSTM, GRU has a more simplified structure, integrating the “input” and “forget” gates into a single “update” gate by introducing gate recurrent units [22]. However, the reduced number of gate mechanisms in GRU can lead to performance inferior to LSTM in certain tasks, especially those involving complex long-term dependencies.

ConvLSTM is a neural network architecture that combines CNN and LSTM [23]. It extends the idea of FC-LSTM to make convolutional structures in both the input-to-state and state-to-state transitions. By stacking multiple ConvLSTM layers and forming an encoding-forecasting structure, ConvLSTM can better capture spatiotemporal correlations. However, the introduction of convolutional operations in ConvLSTM leads to increased parameter volume, potentially resulting in overfitting. Moreover, ConvLSTM requires more computational resources and time compared to LSTM.

TCN is a novel architecture based on convolutional neural networks. Unlike traditional CNNs, TCN utilizes dilated causal convolutions and residual connections to capture long-term dependencies in time series data. The dilated causal convolutions in TCN are formed by stacking causal convolutions and dilated convolutions. Causal convolutions ensure that the output at time t is only convolved with elements at time t and earlier in the previous layer, avoiding leakage of future information into the past. However, a major disadvantage of this basic design is the need for an extremely deep network or large filters to achieve a long effective history size [24]. To address this issue, TCN uses exponential dilations, achieving a significantly larger receptive field [25]. The structure of dilated causal convolutions is illustrated in Figure 1.

The residual connections in TCN are designed to add the input to the convolution result and adjust the number of channels when needed, utilizing an additional 1 × 1 convolution to ensure dimension matching between input and convolution results. The residual module in TCN is shown in Figure 2. The unique structure of TCN brings advantages in handling time series data, but it also has some drawbacks. For instance, when dealing with long time series tasks, a continuous stacking of convolutional layers is required, leading to a higher number of parameters, making the model more complex and increasing computational costs. As the network layers deepen, TCN becomes more sensitive to hyperparameter selection, necessitating extensive experimentation and tuning.

In our previous work, Ref. [26] proposed a MTCN model for multivariate time series prediction tasks. However, this model has a large number of parameters, and its predictive accuracy is not significantly superior to the baseline comparison models.

3. Methodology

In this section, we first elaborate on the problem of multivariate time series prediction and subsequently provide a detailed description of the fundamental components employed in our AMTCN model, as well as the overall architecture of the model.

3.1. Sequence Problem Statement

Deep learning leverages deep neural networks to capture complex relationships between data and targets. In the field of deep learning, common sequence problems are typically applied to time series, natural language processing [27], audio processing [28], and other domains. Multivariate time series prediction is a branch within the domain of time series. We generally describe the problem of multi-variate time series prediction as follows: Given an input time series signal

X = {\{x_{1}, x_{2}, \dots x_{t}\}}^{N}

, where N represents the number of input time series, i.e., the feature dimension of the data, and t denotes the length of the sequence. The output sequence is denoted as

Y = {\{y_{t + 1}, y_{t + 2}, \dots y_{t + w}\}}^{H}

, where H is the number of output time series or dimensions, and w is the length of the output sequence, also referred to as the output window size. In our study, we focus on a multi-step prediction with multiple variables as input and a single variable as output. The prediction process can be expressed as Formula (1)

\{y_{t + 1}, y_{t + 2}, \dots y_{t + w}\} = f {\{x_{1}, x_{2}, \dots x_{t}\}}^{N}

(1)

Throughout the entire process, we employed a technique called sliding window, which is widely used in the field of time series. The basic idea is to define a fixed-size window and slide it over the time series data, typically with a fixed sliding step of 1. In each window, the desired subsequence of the time series is extracted.

3.2. Model Structure

To the best of our knowledge, currently, the most suitable models for time series forecasting tasks are RNN models, such as LSTM and its variants. Apart from RNN models, other models applied in time series problems include CNN, Transformer, and their variants. However, RNN models are difficult to train in parallel, resulting in slow training speeds. In contrast, CNN and Transformer models offer parallelism advantages over RNN. Considering these factors, for the multivariable time series forecasting task, we propose an AMTCN inspired by the basic modules of TCN and Transformer. Our goal is to combine multi-head attention mechanisms to design an efficient convolutional network that can handle diverse multi-variable time series effectively. The AMTCN model exhibits the following key features:

(1): Adaptability to various scenarios is enabled for both input and output.
(2): Integration of two unique asymmetric residual blocks.
(3): Development of individual sub-models dedicated to each feature in the input data, termed the multi-head model.
(4): Utilization of multi-head attention mechanisms to capture interdependencies among amalgamated sub-models representing diverse feature sequences.

In this work, we emphasize the combination of dilated convolutional residual blocks and multi-head attention mechanisms to construct the multi-head model, enabling it to perform forecasting tasks on electricity consumption and other scenarios under different input–output conditions. The entire architecture of AMTCN is illustrated in Figure 3. Initially, a multi-head structure is utilized to extract information for each feature variable. Subsequently, we combine the information from each feature and apply the attention mechanism to the entire sequence of feature information, enabling the model to learn different focus points and relationships and further enhancing its expressive power and generalization ability. In the following sections, we will explain each part of the AMTCN module in detail.

3.2.1. Dilated Convolution

In WaveNet [29] and TCN, causal convolutions are used, where the current input values only depend on past input values and do not consider future input values. This approach prevents future information from leaking to the past, ensuring causality. In our work, we employ the sliding window and walk-forward validation techniques, ensuring that each data window contains only a segment of past observations. The model uses these windowed past values to predict the output of the next time step, avoiding future leakage. Therefore, we did not use causal convolutions as in TCN; instead, we adopted regular dilated convolutions. The benefit of this approach is that when dealing with more complex sequences, our convolutional approach can obtain a larger receptive field than causal convolutions, significantly improving model efficiency. Additionally, traditional CNN models often apply pooling operations after convolution, reducing computational complexity and extracting only the most salient features from the input feature maps. However, this process may lead to information loss. In contrast, dilated convolutions offer another advantage of not requiring pooling operations. Instead, they expand the receptive field and extract features through layers of dilation parameters, enabling more effective capture of contextual information. The definition of dilated convolutions is given in Equation (2):

Y [:, t] = \sum_{i = 0}^{k - 1} (f [:, i] * x [:, t - i * d])

(2)

where Y represents the output time series, and t and d denote the time step and the number of the space, respectively. We consider all batches, where k represents the filter size, f denotes the weights, x denotes the input data, and d is the dilation factor. Figure 4 illustrates the dilated convolutions under various dilation factors.

3.2.2. Residual Block

Residual connection networks address the challenges posed by vanishing and exploding gradients during the training of deep networks. They achieve this by employing shortcut connections [30], which allow the output of earlier network layers to be added to the input of subsequent layers, thereby mitigating the issues associated with increasing network depth. Similarly, in our proposed network, the utilization of residual connections is implemented. The network has two distinctive residual blocks, each encompassing a parallel convolution network comprising dilated convolutions and non-linear activation functions. To reduce the computational burden arising from deeper network structures and varying convolutional kernel sizes, we adopt an asymmetric block structure, which enhances the network’s efficiency [31].

The detailed structure of the two asymmetric residual blocks is illustrated in Figure 5. Initially, the input data are divided into two channels, and each channel undergoes dilated convolutions followed by introducing non-linearity through the Relu activation function. Finally, the results from both channels are concatenated to form the overall stack. In the first residual block, we repeat this process three times, and in the second residual block, we repeat it four times. This design results in the formation of an asymmetric structure for the two residual blocks. Ultimately, we integrate shortcut connections, utilizing 1 × 1 convolutions to adjust dimensions and form a complete residual network structure.

The mathematical formulations of the internal operations within the residual block are given as Equations (3)–(5):

f_{1} = R e l u (W_{k} * X + b_{k})

(3)

f_{2} = R e l u (W_{k} * X + b_{k})

(4)

f = f_{1} + f_{2}

(5)

In the above equations,

f_{1}

represents the output of channel one,

f_{2}

denotes the output of channel two, and f is the combined result of the two channels. The residual connections in the residual connection network are implemented using 1 × 1 convolutions to adjust dimensions. Similar to TCN, the final output is computed as Formula (6):

o u t = (x + F (x))

(6)

3.2.3. Multi-Head Attention Mechanism

In our network model, to extract the global feature information of subsequences processed by the residual network and capture the contextual dependencies between different feature sequences, we incorporated the multi-head attention mechanism after the convolutional module. In the Transformer architecture, multi-head attention is a core component. Unlike the self-attention mechanism, multi-head attention learns h sets of different linear projections to transform queries, keys and values. Subsequently, these h sets of transformed queries, keys, and values are processed in parallel by the attention function. Finally, the outputs from h sets of attention are concatenated and projected again to obtain the ultimate output [13]. The structure of the multi-head attention mechanism is shown in Figure 6.

The entire computation process of the multi-head attention mechanism can be formulated by following Equations (7)–(9):

F_{m u l t i h e a d} (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{i}) W^{0}

(7)

h e a d_{i} = f_{A t t e n t i o n} (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(8)

f_{A t t e n t i o n} (Q, K, V) = s o f t m a x (\frac{Q K_{T}}{\sqrt{d_{K}}} V)

(9)

where the projections are represented by parameter matrices

W_{i}^{Q} \in R^{d_{m} \times d_{q}}

,

W_{i}^{K} \in R^{d_{m} \times d_{k}}

,

W_{i}^{V} \in R^{d_{m} \times d_{v}}

, and

C o n c a t

denotes vector concatenation. The division by

\sqrt{d_{K}}

is used to prevent the results of softmax from becoming excessively large. Furthermore, additive attention, often referred to as Bahdanau attention [32], is integrated into the model. This attention mechanism calculates an attention score between the current input and the preceding hidden state. Subsequently, this score is used to perform a weighted sum over the various positions in the input sequence, resulting in a context vector for the current time step. This context vector assists the model in better understanding different parts of the input sequence. By incorporating additive attention, our objective is to enable the model to focus more effectively on essential components within the input sequences of various features. Moreover, additive attention offers greater flexibility, as it learns weight matrices to adapt to diverse tasks and data characteristics, thereby automatically adjusting attention allocation. This particular attribute enhances the versatility of our model, making it more applicable to diverse prediction tasks across different scenarios.

3.2.4. AMTCN Model

For the multivariate time series, our approach involves segregating each data sequence of different variables separately, resulting in multiple univariate time series. Initially, we apply modules such as the dilated convolutional network to perform feature extraction on each univariate time series. Next, we combine the information of these distinct variables and employ multi-head attention to capture the correlations among these different feature variables and the significant information within the sequences. Ultimately, the outputs are concatenated and processed through fully connected layers to produce the final results. Leveraging these multi-head characteristics, we have devised the AMTCN multi-head model, which can effectively handle the mapping relationships between multivariate time series, instead of treating all variables as a unified entity.

The architecture of the AMTCN, built upon the convolutional neural network, is depicted in Figure 3. It involves stacking multiple layers of different residual blocks, with each residual block containing a specific number of convolutional neural networks. This combination of various residual blocks, along with the self-attention mechanism, enables the deep neural network to effectively handle input sequences of varying lengths and different numbers of variables.

4. Experiments

This section commences by presenting four publicly accessible datasets employed in this research. It proceeds to elaborate on the data preprocessing techniques and evaluation metrics applied to assess the performance of the AMTCN model. Subsequent to this, a series of comparative experiments is conducted, pitting the AMTCN model against alternative network architectures across varied prediction window lengths. Lastly, in order to validate the attention mechanism’s efficacy, ablation experiments focusing on the attention module were devised.

4.1. Datasets and Data Processing

Four benchmark datasets were used that are publicly available. Table 1 summarizes the basic information about each dataset.

The Electricity-D Dataset (available online at: https://www.kaggle.com/datasets/fedesoriano/electric-power-consumption (accessed date: 12 December 2023)) represents the electricity consumption situation in Tetouan, a city located in the northern region of Morocco. This dataset consists of 9 feature columns, including the Date, Time, Temperature, Humidity, Wind Speed, General Diffuse Flows, Diffuse Flows and the Power Consumption of Zone 1, Zone 2, and Zone 3. The three zones correspond to Quads, Smir, and Boussafou, respectively. Each feature column contains a data length of 52,416 samples. The data were collected from the Supervisory Control and Data Acquisition System (SCADA) of Amendis https://www.amendis.ma/fr (accessed date: 22 November 2024), a public service operator responsible for the distribution of drinking water and electricity since 2002 [33]. The dataset has a time interval of 10 min, while the other datasets have a data interval of 1 h. In our work, we focus on using Zone 1 Power Consumption as the target for prediction and disregard the data from the other two zones.

The ISO-NE Dataset (available online at: https://www.iso-ne.com/isoexpress/web/reports/load-and-demand (accessed date: 12 November 2023)) was collected and curated by our research team. It contains hourly electricity demand and temperature data from March 2003 to December 2014. Each feature column has a total data length of 103,776 samples, with the hourly electricity demand being our target for prediction.

The Weather Dataset (available online at: https://www.kaggle.com/datasets/vivovinco/hourly-weather-data-in-gallipoli-20082021 (accessed date: 5 December 2023) or https://www.meteoblue.com/en/weather/week/gallipoli_italy_3176366 (accessed date: 5 December 2023)) provides hourly weather data from 2008 to 2021 for Gallipoli, Turkey. This dataset comprises 9 feature columns (excluding the time label), including Temperature, Sunshine Duration, Shortwave Radiation, Relative Humidity, Mean Sea Level Pressure, Soil Temperature, Soil Moisture, Wind Speed, and Wind Direction. Each feature column has a data length of 122,734 samples, with Temperature being the target variable for our prediction.

The Electricity-R Dataset (available online at: https://www.kaggle.com/datasets/stefancomanita/hourly-electricity-consumption-and-production (accessed date: 23 December 2023) or https://www.transelectrica.ro/ro/web/tel/home (accessed date: 23 December 2023)) comprises Hourly Electricity Consumption and Production by Type in Romania for a duration of 4.5 years. This dataset was collected from the Romanian grid operator, Transelectrica, and plays a significant role in analyzing electricity import and export trade. It encompasses data for the period from 1 January 2019 to 12 March 2023, with hourly records for nine features: Consumption, Production, Nuclear, Wind, Hydroelectric, Oil and Gas, Coal, Solar, and Biomass. The total length of each feature column is 36,773. Notably, the Production feature represents the cumulative sum of the subsequent seven features. To facilitate the analysis of electricity consumption, we performed visualization and correlation analysis on this dataset, aiming to gain a clearer understanding and organization of the data. The results of the correlation analysis for the nine features in the Electricity-R Dataset yielded are displayed in Figure 7.

As shown in Figure 7, it is evident that Consumption and Production exhibit the highest correlation from the correlation analysis, whereas Nuclear, Wind, and Solar have relatively low correlations with Consumption. In order to reduce redundant capacity information and enhance the performance of the prediction model, we made the decision to exclude Nuclear, Wind, and Solar features from our research. Consequently, in the experimental phase, the model was built using only six selected features. For a more comprehensive grasp of the data, the information related to the features we need to predict is depicted in Figure 8. This visualization contributes to the holistic dataset analysis. Of note is the meticulous division of the Consumption feature data into distinct training and testing sets, forming the basis for subsequent processing.

For each dataset, we performed a data splitting process for the training, validation, and testing sets with a ratio of 3:1:1, respectively. To preprocess the original input data, we applied min–max normalization to normalize the original input data. Following the prediction process, a restoration procedure was conducted to revert the normalized data back to its original scale or subsequent evaluation. The mathematical formulations for both normalization and restoration processes are presented in Formulas (10) and (11)

X_{s c a l e d} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(10)

X = X_{s c a l e d} \times (X_{m a x} - X_{m i n}) + X_{m i n}

(11)

where X is our original data,

X_{s c a l e d}

is the normalized data, and

X_{m a x}

and

X_{m i n}

are the maximum and minimum values in the original data. Moreover, any missing values or NA values in the dataset were replaced with zeros.

4.2. Evaluation Criteria

In this study, we employed four evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Relative Root Mean Squared Error (RRSE), and Correlation Coefficient (CORR).

MAE measures the average absolute difference between the predicted values and the actual observations. A lower MAE indicates a better predictive performance, as it signifies a smaller average absolute error between the predicted and observed values. MSE assesses the average level of discrepancy between the predicted values and the true values. Similar to MAE, a smaller MSE indicates a more accurate predictive model. RRSE is a normalized version of RMSE, obtained by dividing MSE by the range of the actual observations. It provides a relative error value, which allows for comparison of prediction errors across different datasets. Like MAE and MSE, a smaller RRSE value indicates better model performance. CORR is utilized to measure the linear relationship between the predicted values and the true values. It shows whether the model’s predictive trend aligns with the true values. The CORR value ranges between −1 and 1, where values closer to 1 indicate a positive correlation, values closer to −1 signify a negative correlation, and values closer to 0 suggest no linear relationship. The computation of these evaluation metrics is given by the following Equations (12)–(15)

M A E = \frac{1}{n} \sum_{n}^{i = 1} |y^{i} - {\hat{y}}^{i}|

(12)

M S E = \frac{1}{n} \sum_{n}^{i = 1} {(y^{i} - {\hat{y}}^{i})}^{2}

(13)

R R S E = \sqrt{\frac{\sum_{i = 1}^{n} {(y^{i} - {\hat{y}}^{i})}^{2}}{\sum_{i = 1}^{n} {(y^{i} - \bar{y})}^{2}}}

(14)

C O R R = \frac{1}{n} \sum_{i}^{n} \frac{\sum (y^{i} - \bar{y}) ({\hat{y}}^{i} - \bar{\hat{y}})}{\sqrt{\sum {(y^{i} - \bar{y})}^{2} {({\hat{y}}^{i} - \bar{\hat{y}})}^{2}}}

(15)

where

y_{i}

represents the ith true value,

\bar{y}

denotes the mean of all true values,

{\hat{y}}^{i}

is the ith predicted value, and

\hat{\bar{y}}

represents the mean of all predicted values. Smaller values of MAE, RRSE, and MSE indicate better model performance, while a higher CORR value indicates improved model accuracy in capturing the underlying trends.

4.3. Training Procedure

Our training process can be represented by the following Algorithm 1: where min_lr represents the minimum learning rate, init_lr is the initial learning rate, epoch is the number of times the model has been trained, t denotes the number of epochs in which the optimal MSE does not change, after which the learning rate starts to decrease, new_lr represents the new learning rate, and out_put represents the model output of the MSE. The training procedure Algorithm 1 can be illustrated as following:

Algorithm 1 Training Procedure

1: epoch = 200; init_lr; factor = 0.8, min_lr = 1 ×

10^{- 4}

, n,t
2: for n < epoch do
3:     t = 1
4:     if out_put > MSE
5:        out_put = MSE
6:        save model
7:        if t >= 10
8:            if init_lr > min_lr
9:                min_lr = init_lr × factor
10:              new_lr = max(new_lr,min_lr)
11:              t = 0

In this study, we employed the walk-forward validation method to address the cross-validation issue, ensuring that the model does not utilize future information while predicting future values. Specifically, walk-forward validation divides the time series data into numerous adjacent time windows. Each time, one window is chosen as the test set, while the remaining windows are used as the training set. The model is trained using data from the training set window, and then predictions and evaluations are conducted on the test set window. This process is repeated until the entire time series is covered.

During the experiments, the output length or prediction window length was configured as [24, 12, 6]. For the multi-step prediction task on different datasets, the evaluation of predictive values was conducted incrementally. Table 2 summarizes the distribution of actual and predicted values during the cross-validation process. The entire process of training and evaluation is outlined as follows:

Step 1.: Starting from the beginning of the test set, the last set of observations from the training set, which is the last time window, is used as the input for the model to predict the next set of window data (the first set of actual values in the validation set).
Step 2.: The model predicts the next time step.
Step 3.: The actual value is obtained and added to the history for the next time step’s application.
Step 4.: The predicted values are compared with the true values for evaluation.
Step 5.: The process returns to Step 1.

This walk-forward validation approach ensures that the model’s predictive performance is evaluated in a robust manner, taking into account the temporal nature of the time series data and preventing information leakage from future time steps.

4.4. Experimental Details

This section offers an in-depth account of the particulars concerning our experiments. The batch size was established at 100, employing the Mean Squared Error (MSE) as the selected loss function. We utilized the Adam optimizer with an initial learning rate of 0.001. Employing these parameter configurations, the predictions and evaluations were performed using the walk-forward validation approach.

For the LSTM model, we configured the number of hidden units in the layers as 50, 100, and 200, and the output layer was set to match the length of the prediction window.

In the ConvLSTM Encoder–Decoder model, the input data were structured as [timestep, row, column, channel]. The timestep was chosen from 1, 3, 7, with the row set at 1 and the column at 24. The channel was selected from 2, 8. Like the LSTM model, this network featured three hidden layers with 64, 128, 200 units. All input-to-state and state-to-state kernels were of size 1 × 3.

The GRU model comprised three hidden layers, each with 64, 128, 256 units, respectively.

In the TCN network, we employed four convolutional layers with kernels of size 1 × 3. We determined the optimal number of hidden units for each layer through a grid search over 30, 50, and 100. The remaining parameters were set as in the original TCN network.

Regarding our AMTCN model, for each dataset, the number of attention heads in the multi-head attention mechanism was set to 2 or 8. The filters in the first residual block were set to 16, and in the two subsequent residual blocks, they were set to 32 and 64, respectively. The AMTCN models were implemented using the keras library with TensorFlow backend. All experiments were conducted on a server equipped with NVIDIA 1080 GPUs (Santa Clara, CA, USA).

4.5. Experimental Results

In this study, we conducted three prediction experiments with different output window lengths while keeping the input window length fixed at 24. Three sets of prediction window lengths, namely 6, 12, and 24, were utilized for conducting comparative experiments involving different datasets and network models.

In the first set of experiments, prediction was carried out using an input window length of 24 and an output window length of 24. The experimental outcomes were then visually presented for all four datasets. Table 3 presents the overall mean values of the MAE, MSE, RRSE and CORR metrics on the test sets for each dataset. From the table, it is evident that the AMTCN model outperformed the other four baseline models across all four metrics. Specifically, when using MSE as the accuracy comparison metric, the AMTCN model exhibited a 57% higher prediction accuracy than the ConvLSTM model in the Electricity-R dataset, and a 45% higher accuracy than the best-performing LSTM model. In the Electricity-D dataset, the AMTCN model showed a 35% higher prediction accuracy than the ConvLSTM model, and an 8% higher accuracy than the best-performing GRU model. In the ISO-NE dataset, the AMTCN model displayed a 14% higher prediction accuracy than the TCN model, and a 10% higher accuracy than the best-performing GRU model. In the Weather dataset, the AMTCN model demonstrated a 39% higher prediction accuracy than the ConvLSTM model, and a 23% higher accuracy than the best-performing TCN model. Furthermore, we randomly selected a continuous 24 h segment (one prediction window) of MSE data from each dataset and visualized the results. For the Electricity-D dataset, the selected segment consisted of 24 intervals of 10 min. As shown in Figure 9, the AMTCN model consistently outperformed the other baseline models, with most of the lowest MSE values at 24 h. This performance consistency across different datasets indicates the AMTCN model’s strong generalization and robustness. Subsequently, in the scenario where the input window size is 24 and the prediction window size is 24, we conducted a comparative analysis of the AMTCN model’s continuous 7 day predictions against the ground truth, as illustrated in Figure 10. Finally, for the electricity consumption forecasting task, we specifically compared the continuous 7 day predictions of all models on the Electricity-R dataset, as shown in Figure 11. Notably, the results clearly indicate that the AMTCN model outperforms other networks significantly.

For the second set of experiments, a prediction window length of 12 was employed, while the third set of experiments employed a prediction window length of 6. An analysis of the aggregate mean values for the four metrics on the test sets was carried out for both electricity consumption datasets. To enable a comparative assessment of predictive capacities across diverse prediction window lengths, an amalgamation of the outcomes from the three experiment sets was executed. This consolidation provided a visual depiction of predictive performance variations under distinct prediction lengths in Table 4. It is observed that when the prediction window length is 12, using MSE as the accuracy metric, the AMTCN model outperforms the baseline models. In the Electricity-R dataset, the AMTCN model’s prediction accuracy is 50% higher than the TCN model and 20% higher than the best-performing LSTM model from the baseline models. In the Electricity-D dataset, the AMTCN model’s prediction accuracy is 27% higher than the LSTM model and 15% higher than the best-performing GRU model among the baseline models. Additionally, it is evident that when the prediction window is set to 6, the performance of the metrics is the best, followed by 12, and the worst is 24. This leads to the conclusion that, with a fixed input window, shorter prediction windows lead to higher predictive accuracy. Similar to the first set of experiments, we randomly selected a continuous 12-hour segment of MSE data for visualization, as shown in Figure 12. It can be observed that the AMTCN model outperforms other baseline models under the prediction window of 12. Finally, concerning the electricity consumption prediction task, the distribution of predictive outcomes and actual values for each network in the context of the Electricity-R dataset was graphically presented. This was achieved within a continuous 7 day prediction window of 12, as shown in Figure 13. From the graph, it is evident that under the prediction window of 12, the AMTCN model’s predictive performance is superior to other networks.

In the third set of experiments, our analysis methodology was the same as in the second set. The results, as presented in Table 4, show that with a prediction window length of 6, the AMTCN model outperforms the TCN model in terms of prediction accuracy by 56% on the Electricity-R dataset, and it is 13% higher than the best-performing GRU model. Similarly, on the Electricity-D dataset, the AMTCN model’s prediction accuracy is 36% higher than the TCN model and 19% higher than the best-performing GRU model. The MSE distribution of the continuous six predicted observations by the AMTCN model is also superior to that of other baseline models, as shown in Figure 14. Furthermore, Figure 15 displays the predictive results of the AMTCN model and various other networks under a prediction window of six, confirming the superior predictive performance of the AMTCN model.

In conclusion, the predictive outcomes of each network on the Electricity-R dataset are visually represented across three distinct prediction window lengths, as depicted in Figure 16. By amalgamating the four assessment metrics for each prediction window, it becomes evident that the reduced prediction windows yield heightened prediction accuracy at the expense of extended model training time.

Across the three experiment sets, the displayed results underscore AMTCN model’s enhanced predictive capacity, broader generalization, and increased robustness in contrast to alternative baseline models across various scenarios and prediction demands.

4.6. Ablation Tests

To validate the effectiveness of the attention mechanism module in our model, we conducted an ablation experiment. Specifically, we removed all attention mechanism modules from the AMTCN model, retaining only the three residual blocks, as illustrated in Figure 17, named MTCN. Additionally, the performance comparison between the AMTCN model and the ablation model was undertaken on each dataset utilizing the designated evaluation metrics.

Table 5 summarizes the overall average values of MAE, MSE, RRSE, and CORR metrics for the ablation model under the condition of an input window of 24 and a prediction (output) window of 24. It can be observed that based on MSE as the accuracy criterion, the attention mechanism module has significantly contributed to the model’s predictive performance on each dataset. Specifically, in the Electricity-R dataset, the attention mechanism module improved the prediction accuracy by 12%. In the Electricity-D dataset, the module resulted in an 18% improvement in prediction accuracy. For the ISO-NE dataset, the attention mechanism module led to a 10% increase in prediction accuracy. Moreover, in the Weather dataset, the attention mechanism module substantially enhanced the prediction accuracy by 33%. This investigation exemplifies the pivotal significance of the attention mechanism module in enhancing the holistic predictive performance of the AMTCN model across diverse datasets.

In the same manner, under the condition of an input window of 24 and a prediction (output) window of 24, we visualized the comparison of MSE data for 24 consecutive predictions made by the two networks on the four datasets, as shown in Figure 18. It is evident from the figure that the AMTCN network with the added attention mechanism module exhibits significantly superior predictive capabilities compared to the MTCN network without the attention mechanism and the prediction effect is smoother for continuously changing data.

Table 6 summarizes the total number of parameters for the ablation model MTCN and the full AMTCN model. One could discern that our attention module not only augments the predictive capabilities of the model but also diminishes the overall parameter count, consequently economizing computational expenses. This analysis further confirms the effectiveness and efficiency of our attention mechanism module in improving the predictive performance of the AMTCN model.

5. Conclusions

In this paper, we proposed a novel model called AMTCN for predicting multivariate time series across four different datasets. Specifically, we addressed the electricity consumption forecasting problem and conducted a detailed analysis and preprocessing on the electricity consumption dataset. The AMTCN model incorporates additive attention, multi-head attention from traditional Transformers, dilated convolutions, residual connections, and other network structures. This design choice effectively overcomes the limitations of conventional CNN and RNN networks in capturing long-term dependencies and their inability to parallelize computations. Moreover, the multi-head attention mechanism strengthens the model’s ability to extract important contextual information and correlations among different feature sequences, leading to improved predictive accuracy. For the assessment of the AMTCN model’s performance, a series of three experiments was conducted, incorporating distinct prediction window lengths across various datasets. The comparison involves the utilization of four evaluation metrics: MAE, MSE, RRSE and CORR. This comparison was performed against four baseline models: LSTM, GRU, ConvLSTM and TCN. The experimental results indicate that the AMTCN model outperforms the baseline model. The maximum improvements observed are 57% for the MSE prediction metric, 37% for MAE, 35% for RRSE, and 12% for CORR. This finding substantiates the model’s superior competitiveness, robustness, and generalization capabilities, which may exemplify its potential as a viable solution for real-world time series forecasting endeavors.

Author Contributions

Conceptualization, R.W. and W.Z.; formal analysis, F.Y. and J.L.; funding acquisition, W.Z.; investigation, F.Y.; project administration, R.W. and W.Z.; resources, W.D. and S.T.; software, M.L.; supervision, R.W. and W.Z.; visualization, Y.H.; writing of the manuscript, R.W., W.Z. and Y.H.; reviewing and editing, Y.H., R.W. and W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Research Foundation of the Education Bureau of Hubei Province, China (No. Q20231713).

Data Availability Statement

The Electricity-D Dataset (available online at: https://www.kaggle.com/datasets/fedesoriano/electric-power-consumption (accessed date: 12 December 2023)). The Supervisory Control and Data Acquisition System (SCADA) of Amendis dataset (available online at: https://www.amendis.ma/fr (accessed date: 22 January 2024)) Amendis (https://www.amendis.ma/fr (accessed date: 22 January 2024)). The ISO-NE Dataset (available online at: https://www.iso-ne.com/isoexpress/web/reports/load-and-demand (accessed date: 12 November 2023)). The Weather Dataset (available online at: https://www.kaggle.com/datasets/vivovinco/hourly-weather-data-in-gallipoli-20082021 (accessed date: 5 December 2023) or https://www.meteoblue.com/en/weather/week/gallipoli_italy_3176366 (accessed date: 5 December 2023)). The Electricity-R Dataset (available online at: https://www.kaggle.com/datasets/stefancomanita/hourly-electricity-consumption-and-production (accessed date: 23 December 2023) or https://www.transelectrica.ro/ro/web/tel/home (accessed date: 23 December 2023)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Panwar, N.L.; Kaushik, S.C.; Kothari, S. Role of renewable energy sources in environmental protection: A review. Renew. Sustain. Energy Rev. 2011, 15, 1513–1524. [Google Scholar] [CrossRef]
Hussain, A.; Rahman, M.; Memon, J.A. Forecasting electricity consumption in Pakistan: The way forward. Energy Policy 2016, 90, 73–80. [Google Scholar] [CrossRef]
Kim, Y.; Son, H.g.; Kim, S. Short term electricity load forecasting for institutional buildings. Energy Rep. 2019, 5, 1270–1280. [Google Scholar] [CrossRef]
Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
Fan, G.F.; Yu, M.; Dong, S.Q.; Yeh, Y.H.; Hong, W.C. Forecasting short-term electricity load using hybrid support vector regression with grey catastrophe and random forest modeling. Util. Policy 2021, 73, 101294. [Google Scholar] [CrossRef]
Abbasimehr, H.; Paki, R.; Bahrini, A. A novel XGBoost-based featurization approach to forecast renewable energy consumption with deep learning models. Sustain. Comput. Inform. Syst. 2023, 38, 100863. [Google Scholar] [CrossRef]
Zulfiqar, M.; Kamran, M.; Rasheed, M.; Alquthami, T.; Milyani, A. Hyperparameter optimization of support vector machine using adaptive differential evolution for electricity load forecasting. Energy Rep. 2022, 8, 13333–13352. [Google Scholar] [CrossRef]
Fu, Y.; Li, Z.; Zhang, H.; Xu, P. Using Support Vector Machine to Predict Next Day Electricity Load of Public Buildings with Sub-metering Devices. Procedia Eng. 2015, 121, 1016–1022. [Google Scholar] [CrossRef]
Atef, S.; Eltawil, A.B. Assessment of stacked unidirectional and bidirectional long short-term memory networks for electricity load forecasting. Electr. Power Syst. Res. 2020, 187, 106489. [Google Scholar] [CrossRef]
Li, G.; Zhong, X. Parking demand forecasting based on improved complete ensemble empirical mode decomposition and GRU model. Eng. Appl. Artif. Intell. 2023, 119, 105717. [Google Scholar] [CrossRef]
Niu, Z.; Yu, Z.; Tang, W.; Wu, Q.; Reformat, M. Wind power forecasting using attention-based gated recurrent unit network. Energy 2020, 196, 117081. [Google Scholar] [CrossRef]
Limouni, T.; Yaagoubi, R.; Bouziane, K.; Guissi, K.; Baali, E.H. Accurate one step and multistep forecasting of very short-term PV power using LSTM-TCN model. Renew. Energy 2023, 205, 1010–1024. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar]
Li, X.; Zhong, Y.; Shang, W.; Zhang, X.; Shan, B.; Wang, X. Total electricity consumption forecasting based on Transformer time series models. Procedia Comput. Sci. 2022, 214, 312–320. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. arXiv 2020, arXiv:2012.07436. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–14 December 2021. NIPS ’21. [Google Scholar]
Ødegaard Bentsen, L.; Warakagoda, N.D.; Stenbro, R.; Engelstad, P. Spatio-temporal wind speed forecasting using graph networks and novel Transformer architectures. Appl. Energy 2023, 333, 120565. [Google Scholar] [CrossRef]
Han, Z.; Zhao, J.; Leung, H.; Ma, K.F.; Wang, W. A Review of Deep Learning Models for Time Series Prediction. IEEE Sensors J. 2021, 21, 7833–7848. [Google Scholar] [CrossRef]
Chen, Z.; Ma, M.; Li, T.; Wang, H.; Li, C. Long sequence time-series forecasting with deep learning: A survey. Inf. Fusion 2023, 97, 101819. [Google Scholar] [CrossRef]
Bengio, Y.; Simard, P.Y.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 52, 157–166. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Amalou, I.; Mouhni, N.; Abdali, A. Multivariate time series prediction by RNN architectures for energy consumption forecasting. Energy Rep. 2022, 8, 1084–1091. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1, Cambridge, MA, USA, 7–12 December 2015; NIPS’15. pp. 802–810. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Wan, R.; Mei, S.; Wang, J.; Liu, M.; Yang, F. Multivariate Temporal Convolutional Network: A Deep Neural Networks Approach for Multivariate Time Series Forecasting. Electronics 2019, 8, 876. [Google Scholar] [CrossRef]
Bak, G.; Bae, Y. Deep learning algorithm development for river flow prediction: PNP algorithm. Soft Comput. 2023, 27, 13487–13515. [Google Scholar] [CrossRef]
Noda, K.; Yamaguchi, Y.; Nakadai, K.; Okuno, H.G.; Ogata, T. Audio-visual speech recognition using deep learning. Appl. Intell. 2015, 42, 722–737. [Google Scholar] [CrossRef]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Gupta, A.; Pawade, P.; Balakrishnan, R. Deep Residual Network and Transfer Learning-based Person Re-Identification. Intell. Syst. Appl. 2022, 16, 200137. [Google Scholar] [CrossRef]
Glegoła, W.; Karpus, A.; Przybyłek, A. MobileNet family tailored for Raspberry Pi. Procedia Comput. Sci. 2021, 192, 2249–2258. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Salam, A.; Hibaoui, A.E. Comparison of Machine Learning Algorithms for the Power Consumption Prediction: -Case Study of Tetouan city-. In Proceedings of the 2018 6th International Renewable and Sustainable Energy Conference (IRSEC), Rabat, Morocco, 5–8 December 2018; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. A dilated causal convolution with dilation factors d = 1, 2, 4 and filter size k = 3.

Figure 2. TCN residual block. A 1 × 1 convolution is added when residual input and output have different dimensions.

Figure 3. Overall architecture of the AMTCN model.

Figure 4. Visualization of dilated convolution with different dilation factors.

Figure 5. An overview of two residual blocks with asymmetric structure: Residual Block 1 with three layers of dilated convolution (left) and Residual Block 2 with four layers dilated (right).

Figure 6. Multi-head attention consists of several attention layers running in parallel.

Figure 7. The Pearson correlation coefficient between different power generation methods and consumption.

Figure 8. Distribution of data used in the Electricity-R dataset for the consumption feature.

Figure 9. The MSE of the predicted values of each network on the four datasets with a prediction window of 24.

Figure 10. Predicted and actual values of the AMTCN model for seven consecutive days on four datasets.

Figure 11. Prediction results for each network on the Electricity-R dataset for 7 consecutive days.

Figure 12. MSE of the predicted values of each network on the two power consumption datasets for a prediction window of 12.

Figure 13. Prediction results for each network on the Electricity-R dataset for 7 consecutive days with a prediction window of 12.

Figure 14. MSE of the predicted values of each network on the two power consumption datasets for a prediction window of 6.

Figure 15. Prediction results for each network on the Electricity-R dataset for 7 consecutive days with a prediction window of 6.

Figure 16. Seven consecutive days of AMTCN model predictions for Electricity-R under three prediction windows.

Figure 17. Overall structure of the MTCN ablation model.

Figure 18. MSE of predicted values of MTCN and AMTCN on each dataset with a prediction window of 24.

Table 1. Basic information on the four datasets.

Datasets	Length of Each Variable	Number of Variable	Sample Rate
Electricity-D	52,416	6	10 min
Electricity-R	36,773	9	1 h
ISO-NE	103,776	2	1 h
Weather	122,734	9	1 h

Table 2. Dataset statistics, where h denotes hour, and d represents day.

Input (Actual Value)	Output (Predicted Value)
Current Time Window Length	Next Time Window Length
1 d 1 h–1 d 24 h	2 d 1 h–2 d 24 h
2 d 1 h–1 d 24 h	3 d 1 h–3 d 24 h
⋯	⋯
1 d 1 h–1 d 12 h	1 d 13 h–1 d 24 h
1 d 13 h–1 d 24 h	2 d 1 h–2 d 12 h
⋯	⋯
1 d 1 h–1 d 6 h	1 d 7 h–1 d 12 h
1 d 7 h–1 d 12 h	1 d 13 h–1 d 18 h
⋯	⋯

Table 3. Overall mean of MAE, MSE, RRSE, and CORR for the prediction results of each network under each dataset.

Methods	Metrics	Electricity-R	Electricity-D	ISO-NE	Weather
Prediction window length		24	24	24	24
LSMT	MSE	0.370	0.103	0.065	0.078
	MAE	0.411	0.226	0.176	0.204
	RRSE	0.4851	0.3213	0.2547	0.2800
	CORR	0.8853	0.9506	0.9677	0.9603
ConvLSTM	MSE	0.469	0.124	0.060	0.092
	MAE	0.471	0.257	0.164	0.228
	RRSE	0.6850	0.3527	0.2445	0.3037
	CORR	0.8174	0.9406	0.9698	0.9528
GRU	MSE	0.395	0.087	0.058	0.079
	MAE	0.473	0.213	0.163	0.204
	RRSE	0.6290	0.2949	0.2405	0.2820
	CORR	0.8394	0.9587	0.9708	0.9599
TCN	MSE	0.411	0.119	0.065	0.073
	MAE	0.430	0.244	0.178	0.199
	RRSE	0.6416	0.3459	0.2554	0.2707
	CORR	0.8046	0.9411	0.9669	0.9628
AMTCN	MSE	0.201	0.080	0.052	0.056
	MAE	0.304	0.205	0.155	0.167
	RRSE	0.4483	0.2824	0.2277	0.2364
	CORR	0.9042	0.9596	0.9740	0.9717

Table 4. Overall mean values of MAE, MSE, RRSE, and CORR for each network on the two datasets of electrical energy consumption for the three prediction window scenarios.

Methods	Metrics	Electricity-R	Electricity-D
Prediction window length		24/12/6	24/12/6
LSMT	MSE	0.370/0.116/0.084	0.103/0.091/0.037
	MAE	0.411/0.234/0.189	0.226/0.204/0.124
	RRSE	0.4851/0.3407/0.2902	0.3213/0.3031/0.1932
	CORR	0.8853/0.9432/0.9591	0.9506/0.9608/0.9827
ConvLSTM	MSE	0.469/0.152/0.111	0.124/0.081/0.037
	MAE	0.411/0.273/0.238	0.257/0.209/0.134
	RRSE	0.6850/0.3896/0.3329	0.3527/0.2855/0.1934
	CORR	0.8174/0.9252/0.9490	0.9406/0.9617/0.9816
GRU	MSE	0.395/0.123/0.063	0.087/0.078/0.031
	MAE	0.473/0.248/0.166	0.213/0.200/0.116
	RRSE	0.4851/0.3407/0.2902	0.2949/0.2658/0.1758
	CORR	0.8853/0.9432/0.9591	0.9587/0.9679/0.9851
TCN	MSE	0.411/0.186/0.124	0.119/0.085/0.039
	MAE	0.430/0.299/0.249	0.244/0.211/0.140
	RRSE	0.6416/0.4317/0.3518	0.3459/0.2919/0.1965
	CORR	0.8046/0.9144/0.9424	0.9411/0.9611/0.9823
AMTCN	MSE	0.201/0.093/0.055	0.080/0.066/0.025
	MAE	0.304/0.204/0.156	0.205/0.181/0.109
	RRSE	0.4483/0.3030/0.2346	0.2824/0.2630/0.1585
	CORR	0.9042/0.9581/0.9731	0.9596/0.9667/0.9886

Table 5. Mean values of each metric for the MTCN ablation model and the AMTCN model on the electrical energy consumption dataset for a prediction window of 24.

Methods	Metrics	Electricity-R	Electricity-D	ISO-NE	Weather
prediction window length		24	24	24	24
MTCN	MSE	0.229	0.098	0.058	0.083
	MAE	0.326	0.227	0.162	0.214
	RRSE	0.4787	0.3132	0.2415	0.2881
	CORR	0.8908	0.9509	0.9705	0.9577
AMTCN	MSE	0.201	0.080	0.052	0.056
	MAE	0.304	0.205	0.155	0.167
	RRSE	0.4483	0.2824	0.2277	0.2364
	CORR	0.9042	0.9596	0.9740	0.9717

Table 6. Mean values of each metric for the MTCN ablation models and AMTCN models on the electrical energy consumption dataset for a prediction window of 24.

Methods	Electricity-R	Electricity-D	ISO-NE	Weather
MTCN	3,815,684	2,551,364	4,552,004	3,815,684
AMTCN	1,277,764	1,450,052	275,896	2,606,084

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Liu, J.; Deng, W.; Tang, S.; Yang, F.; Han, Y.; Liu, M.; Wan, R. AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction. Electronics 2024, 13, 4080. https://doi.org/10.3390/electronics13204080

AMA Style

Zhang W, Liu J, Deng W, Tang S, Yang F, Han Y, Liu M, Wan R. AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction. Electronics. 2024; 13(20):4080. https://doi.org/10.3390/electronics13204080

Chicago/Turabian Style

Zhang, Wei, Jiaxuan Liu, Wendi Deng, Siyu Tang, Fan Yang, Ying Han, Min Liu, and Renzhuo Wan. 2024. "AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction" Electronics 13, no. 20: 4080. https://doi.org/10.3390/electronics13204080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Sequence Problem Statement

3.2. Model Structure

3.2.1. Dilated Convolution

3.2.2. Residual Block

3.2.3. Multi-Head Attention Mechanism

3.2.4. AMTCN Model

4. Experiments

4.1. Datasets and Data Processing

4.2. Evaluation Criteria

4.3. Training Procedure

4.4. Experimental Details

4.5. Experimental Results

4.6. Ablation Tests

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI