Crude distillation units (CDUs) play a significant role in the refining sector, responsible for fractionating crude oil into intermediate products that are subsequently processed in downstream units to meet the market specifications. The quality of these intermediate products, primarily influenced by the operating conditions of the CDUs, is crucial for ensuring the quality of the final refinery products�[
1]. Currently, the oil refining industry is confronting several challenges, including a significant increase in crude oil prices, fluctuations in product demand driven by market dynamics, and specific regulatory constraints imposed on industrial activities�[
2]. Improving these units may lead to significant improvements in the efficiency and reliability of oil refining processes, ultimately reducing operational costs and increasing overall savings. Developing suitable process control for the CDU is necessary to achieve these objectives. Nevertheless, the existing body of literature primarily focuses on the design of control algorithms, rather than the identification of a CDU to a sufficient accuracy.
In recent years, significant advancements in artificial intelligence have facilitated the development of dynamic models through various data-driven techniques, such as polynomial regressions�[
3,
4,
5,
6], support vector regression (SVR)�[
7], and Artificial Neural Networks (ANNs). For example, Liau et al.�[
8] aimed at optimizing product outputs by using an ANN to predict the yield of kerosene, diesel, and atmospheric gas oil. Motlaghi et al.�[
9] employed an ANN to predict products flow rate that were optimized based on market values. Gueddar et al.�[
10] developed an ANN model to optimize energy efficiency by considering crude oil properties, such as boiling point and crude flow rate. Building on this approach, Durrani et al.�[
11] developed a multi-output ANN model to address variations in crude composition and predict optimum cut-point temperatures, using a hybrid Taguchi and genetic algorithm for more energy-efficient operations. Ochoa-Estopier et al.�[
12] developed an ANN model for a CDU and employed a Simulated Annealing (SA) optimizer to enhance revenue while reducing energy usage. This work was further extended by the same authors in�[
13,
14], who incorporated a heat exchanger network model to enhance operational optimization, aiming to boost net profit while adhering to practical constraints. Shi et al. [
15] modeled a CDU process using a wavelet neural network which was combined with the line-up competition algorithm (LCA) for the economic optimization of the CDU operation. More recently, a Long Short-Term Memory network (LSTM) was developed by [
16] to predict and analyze energy efficiency in the CDU under different operating conditions. A hybrid ANN-SVM model was developed by [
17] to simulate the performance of the CDU accurately and efficiently within an optimization framework. Li et al. [
18] developed a hybrid Fuzzy Logic–ANN model to construct a knowledge-based strategy to adapt to different feedstock properties. Bootstrap ANN models were used by Osuolale and Zhang in [
19] and Muhsin et al. in [
20] to develop a model of a CDU process, with the former authors focusing on energy efficiency and the latter on maximizing the production rate. A comparison between different data-driven models for predicting CDU product properties was investigated by [
21], including PCA-ResNet, SOM-ResNet, Feedforward Neural Networks (FNNs), Partial Least Squares (PLS), and LASSO. The study concluded that incorporating prior knowledge and employing appropriate dimensionality reduction techniques, such as PCA-ResNet, greatly improved model accuracy. Using machine learning can offer more accurate and sometimes computationally efficient solutions compared to the complex and resource-intensive nature of ANNs [
22]. Fadzil et al. [
23] explored five machine learning models for optimizing product yields based on varying feed properties and operating conditions. These models included decision tree regression, support vector regression, ANN, random forest regression, and extreme gradient boosting (XGBoost), with XGBoost demonstrating superior performance.
While ANNs have proven to be suitable for modeling a CDU, they rely on static mapping of outputs from inputs using data; this limits our understanding of the physical or chemical mechanisms governing the CDU. Linear predictors can be utilized for nonlinear systems to effectively capture and model nonlinear behavior while also identifying the dynamic response of a linear system. From a control engineering perspective, this approach provides valuable insights into the process by analyzing its time-domain characteristics [
24]. In this context, Bernard Koopman [
25] introduced his operator, which describes the evolution of measurements in Hamiltonian systems over time. Mezic and his collaborators [
26,
27] expanded these concepts to include nonconservative systems, extending its utility to a wide range of applications, solar panels [
28], power systems [
29], robotics [
30], autonomous driving [
31], biology [
32], and traffic flow [
33], to list a few.
The quality of a model is usually evaluated by its ability to generalize to new, unseen data. In regression tasks, if the model is properly selected in terms of structure and hyperparameters, and overfitting is avoided, the problem can be considered conceptually solved. However, when dealing with real data from advanced process control units, inputs to the model may fall outside the training domain due to factors like model-plant mismatches, poor tuning, or stringent constraints in the closed-loop system. In such situations, it is essential for the regression model to make acceptable predictions or, at the very least, not fail. This subject is the topic of this paper. In this work, a comparative analysis of modeling a CDU under different experimental conditions is conducted. This includes the Koopman operator in both linear (KL) and bilinear (KB) forms, as well as a NARX−NN model. The performance of these models is tested in real-world settings, such as gain and delay mismatches, nonlinearities, and disturbances. Bayesian optimization is used for hyperparameter tuning to ensure a fair comparison. The remainder of the paper is structured as follows:
Section 2 provides preliminaries of the methodologies used.
Section 3 covers the process description and data generation. The results and discussion are presented in
Section 4. Finally, the conclusion and suggestions for future work are outlined in
Section 5.