Data Drift Detection
Authors: India Lindsay, Thomas Schill, Leigh Nicholl
What is data drift?
Data drift refers to significant changes in the distribution of features within a dataset. This could include shifts in the distribution of a single feature or shifts in relationships between variables, such as changes in correlations between features. [1]
What Do Data Drift Detectors Accomplish?
Data drift detection algorithms do not require labels and may be deployed in unsupervised settings. Labels may not be readily available if obtaining them is costly or if, due to the nature of the use case, a significant time lag exists between when the models are applied and when the results can be verified [1].
Data drift detectors can identify distributional shift and the variables driving those changes. They enable the data-driven confirmation of research questions about whether a population has changed over time or following some known event.
General Overview of Data Drift Methodology
Data drift detectors typically use the hypothesis testing framework to identify drift. We want to know whether one batch of data, the “reference data”, is different from another batch, the “test data.” Both sets are summarized. The summary could be as simple as a histogram for each feature, or more complicated, like Principal Component Analysis, a common dimension reduction technique.
Then, the algorithm defines a test statistic based on that summary to quantify the difference between the reference and test data. The null hypothesis is that the test and reference data share a common probability distribution, while the alternative hypothesis is that they do not. If the value of test statistic is very unlikely under the null hypothesis, then the null is rejected, and the drift detector will alarm [1].
Overview of Drift Detection Algorithms
Menelaus, named for the Odyssean hero that defeated the shapeshifting Proteus, implements several data drift detection algorithms, to equip ML practitioners and researchers with the tools to monitor their systems and identify drift [2]. The MITRE Corporation developed this package to advance public access to drift detection tools. MITRE is a federally-funded R&D center that works with government and industry to solve problems for a safer world.
Menelaus implements 4 data drift detection algorithms:
- Hellinger Distance Drift Detection Method (HDDDM) [3]
- Confidence Distribution Batch Detection (CDBD) [4]
- kdq-Tree [5]
- Principal Component Analysis Change Detection (PCA-CD) [6]
These algorithms do not require access to a predictive model or labels. HDDDM and CDBD can operate with batches of data, while PCA-CD is suited to work with streaming data. kdq-Tree can operate in either context. Further detail on these algorithms can be found at the end of the article.
Data Drift Example
This example uses Menelaus version 0.2.0.
Below is a walk-through of data drift detection using Menelaus’ implementation of HDDDM. This example will use synthetic data created for this package. The data is accessible via the Menelaus library, as shown below. It spans years 2007–2021, with each year containing 20,000 observations. It contains variables A-J, a categorical variable ‘cat’, and ‘conf’ containing simulated confidence scores.
Drift is introduced into the data at the following time points.
- Drift 1 occurs in 2009: B changes in mean.
- Drift 2 occurs in 2012: D changes in variance.
- Drift 3 occurs in 2015: E and F become more strongly correlated, while their mean and variances remain the same.
- Drift 4 occurs from 2018–2021: H changes in mean and variance, the range of confidence scores changes from [0, 0.6] to [0.4, 1].
- Drift 5 occurs in 2021: J changes in mean and variance.
Except for drift 4, all instances of drift are only for that single year of data.
Read in Data
from menelaus.datasets import make_example_batch_data
import pandas as pd
import numpy as npdf = make_example_batch_data()
Apply the Data Detector
HDDDM initializes with a reference batch and is updated with later test batches, to be compared to the reference distribution. If drift is not detected, the reference batch is expanded to include the most recent test batch. The estimations of the reference distribution are also updated. If drift is detected, the most recent test batch containing the new distributions is set as the new reference batch. All proceeding test batches will be compared against this new reference.
As HDDDM is a batch-based method, we will prepare the data by specifying 2007 as the reference batch. All proceeding years will be considered test batches.
from menelaus.data_drift.hdddm import HDDDM# Set up reference and test batches, using 2007 as reference year
# -2 indexing removes columns "drift" and "confidence"
reference = df[df.year == 2007].iloc[:, 1:-2]
all_test = df[df.year != 2007]# Setup HDDDM
np.random.seed(1)
hdddm = HDDDM(subsets=8)# Store epsilons per feature for heatmap
years = all_test.year.unique()
heatmap_data = pd.DataFrame(columns = years)# Store drift for test statistic plot
detected_drift = []# Run HDDDM
hdddm.set_reference(reference)
for year, subset_data in df[df.year != 2007].groupby("year"):
hdddm.update(subset_data.iloc[:, 1:-2])
heatmap_data[year] = hdddm.feature_epsilons
detected_drift.append(hdddm.drift_state)
Visualize Results
The first plot we create is a line plot of the test statistic, the difference in Hellinger distances between reference and test batches. A test statistic of 0 indicates no difference. Negative test statistics indicate the test batch is more like the reference batch than the previous test batch. Positive test statistics indicate drift — that the current test batch is significantly different from the reference batch.
HDDDM identifies drifts in 2009, 2010, 2012, 2019, and 2021. These drifts all involve a change in mean or variance, suggesting a strength of HDDDM is its ability to recognize these types of changes. The drift in 2010 is likely identified as the distribution in B changes to its initial state prior to the 2009 drift. Drift in 2015, a change in correlation, is undetected. The drift in 2019 is detected one year late.
import seaborn as sns
import matplotlib.pyplot as plth_distances = [ep - th for ep, th in zip(hdddm.epsilon_values.values(), hdddm.thresholds.values())]# Plot Hellinger Distance against Year, along with detected drift
plot_data = pd.DataFrame(
{"Year": years, "Hellinger Distance": h_distances,
"Detected Drift": detected_drift}
)sns.set_style("white")
plt.figure(figsize=(16, 9))
plt.plot(
"Year", "Hellinger Distance", data=plot_data,
label="Hellinger Distance", marker="."
)plt.grid(False, axis="x")
plt.xticks(years, fontsize=16)
plt.yticks(fontsize=16)
plt.title("HDDDM Test Statistics", fontsize=22) plt.ylabel("Hellinger Distance", fontsize=22)
plt.xlabel("Year", fontsize=22)
plt.ylim([min(h_distances) - 0.02, max(h_distances) + 0.02])for _, t in enumerate(plot_data.loc[plot_data["Detected Drift"] == "drift"]["Year"]):
plt.axvspan( t - 0.2, t + 0.2, alpha=0.5, color="red",
label=("Drift Detected" if _ == 0 else None)
)plt.legend(fontsize='x-large')
plt.axhline(y=0, color="orange", linestyle="dashed")
The test statistics stored by HDDDM can be used to produce a feature heatmap. This heatmap is useful to visualize where drift is occurring.
The following behavior is shown:
- Drift in B is detected in 2009 and 2010 (as it reverts to initial distribution).
- Drift in D is detected in 2012 and 2013 (as it reverts to initial distribution).
- Drift in H is detected in 2019.
- Drift in J is detected in 2021.
- The undetected drift occurs in 2015 in the correlations between E and F.
sns.set_style("whitegrid")
sns.set(rc={"figure.figsize": (16, 9)})# Setup plot
grid_kws = {"height_ratios": (0.9, 0.05), "hspace": 0.3}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws)
coloring = sns.cubehelix_palette(start=0.8, rot=-0.5, as_cmap=True)
ax = sns.heatmap(
heatmap_data,
ax=ax,
cmap=coloring,
xticklabels=heatmap_data.columns,
yticklabels=df.columns[1:12].tolist(),
linewidths=0.5,
cbar_ax=cbar_ax,
cbar_kws={"orientation": "horizontal"},
)ax.set_xticklabels(ax.get_xticklabels(),fontsize=17)
ax.set_title('HDDDM Feature Heatmap', fontsize=22)
ax.set_xlabel("Years",fontsize=22)
ax.set_ylabel("Features",fontsize=22)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=17)
ax.collections[0].colorbar.set_label("Difference in Hellinger Distance", fontsize = 22)
ax.collections[0].colorbar.set_ticklabels(ax.get_xticks(), fontsize=17)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)plt.show()
Limitations and Future Research
Data drift detectors offer the ability to understand when, and potentially where, drift occurs. If you are interested in using data drift detectors primarily to monitor model performance, these detectors avoid the need for labeled data for classification, but come with higher risk of false alarms. To reliably monitor model performance, it may be necessary to complement or verify the results using a supervised or semi-supervised concept drift algorithm.
Like concept drift detection algorithms, a limitation of data drift detection algorithms is that the true occurrence of drift is rarely known in real data. It can be challenging to determine the best combination of settings within a given environment. To identify the optimal parameters, it is necessary to bring in domain knowledge to consider drift alarms within the context of the use case.
Resources to Learn More:
Find our package at the following links. Our team welcomes any feedback and opportunities for collaboration.
To gain a deeper understanding of data drift and the techniques used by a variety of algorithms to detect it, we recommend reading the 2020 Gemaque et al. survey paper An overview of unsupervised drift detection methods [1].
To learn more about the MITRE corporation, visit the website here.
Additional Details on Algorithms in Menelaus:
HDDDM
HDDDM estimates the distribution of features in a dataset using histograms and monitors the Hellinger distance between reference and test batches. When this distance exceeds an adaptive threshold, the user is alerted to drift. HDDDM alerts users to specific features exhibiting the most significant drifts [3].
CDBD
CDBD uses histograms to estimate the distribution of a single feature in a dataset and monitors the Kullback-Leibler divergence between reference and test batches. Intended to monitor the confidence scores of a classifier, CDBD can detect univariate drift in any feature within a dataset or a performance metric, if available [4].
kdq-Tree
kdq-Tree partitions the reference batch into a k-d-quad-tree, then partitions the test batch using the same tree. This defines two empirical distributions over the tree, which are then used to compute the Kullback-Leibler divergence between the batches. The user is alerted to drift when this divergence exceeds a threshold. kdq-Tree can identify joint-distributional drift, alerting the user to regions of the dataspace exhibiting the largest changes. The tree partitioning the space can be visualized to highlight the regions which have changed the most [5].
PCA-CD
PCA-CD uses principal component analysis to summarize the features in the reference and test window and calculates a divergence metric between these components. This metric is monitored dynamically through use of the Page-Hinkley test. If it exceeds a certain threshold, drift is detected. PCA-CD can identify joint-distributional drift [6].
References:
- R. N. Gemaque, A. F. J. Costa, R. Giusti, and E. M. Dos Santos, “An overview of unsupervised drift detection methods,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 6, p. e1381, 2020.
- L. Nicholl, T. Schill, I. Lindsay, A. Srivastava, K. P. McNamara, and S. Jarmale, Menelaus. The MITRE Corporation, 2022. [Online]. Available: https://github.com/mitre/menelaus
- G. Ditzler and R. Polikar, “Hellinger distance based drift detection for nonstationary environments,” in 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE), 2011, pp. 41–48. doi: 10.1109/CIDUE.2011.5948491.
- P. Lindstrom, B. Mac Namee, and S. J. Delany, “Drift detection using uncertainty distribution divergence,” Evolving Systems, vol. 4, no. 1, pp. 13–25, 2013.
- T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi, “An information-theoretic approach to detecting changes in multi-dimensional data streams,” 2006.
- A. A. Qahtan, B. Alharbi, S. Wang, and X. Zhang, “A PCA-Based Change Detection Framework for Multidimensional Data Streams: Change Detection in Multidimensional Data Streams,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 935–944. doi: 10.1145/2783258.2783359.
Approved for Public Release; Distribution Unlimited. Public Release Case Number 22–3165 ©2022 The MITRE Corporation. ALL RIGHTS RESERVED.