.. _drift:

Drift Detection
===============

This page demonstrates how to use tab-right for detecting and visualizing data drift between reference and current datasets. Tab-right provides comprehensive tools for drift detection that help you identify changes in data distributions.

Drift Detection with tab-right
------------------------------

Tab-right offers specialized components for drift detection:

1. ``DriftCalculator`` - Core class for calculating drift between datasets
2. ``DriftPlotter`` - Visualization class for creating plots with both matplotlib and plotly backends
3. ``univariate`` module - Lower-level functions for specific drift calculations

Available Drift Metrics
-----------------------

Tab-right provides multiple metrics for different types of features:

**Numerical Features:**
- **Wasserstein Distance** (default): Measures the earth mover's distance between distributions
- **Kolmogorov-Smirnov Test**: Statistical test for equality of continuous distributions

**Categorical Features:**
- **Cramer's V** (default): Normalized measure of association between categorical variables
- **Chi-Square Test**: Statistical test for independence of categorical variables

Example: Using DriftCalculator and DriftPlotter
-----------------------------------------------

The most concise way to analyze and visualize drift with tab-right is to use the ``DriftCalculator`` and ``DriftPlotter`` classes:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tab_right.drift.drift_calculator import DriftCalculator
    from tab_right.plotting.drift_plotter import DriftPlotter

    # Generate simple dataset for demo
    np.random.seed(42)
    df1 = pd.DataFrame({
        'numeric': np.random.normal(0, 1, 100),
        'category': np.random.choice(['A', 'B', 'C'], 100, p=[0.5, 0.3, 0.2])
    })

    df2 = pd.DataFrame({
        'numeric': np.random.normal(1, 1.2, 120),  # Shift in distribution
        'category': np.random.choice(['A', 'B', 'C'], 120, p=[0.2, 0.3, 0.5])  # Different proportions
    })

    # Create the drift calculator
    drift_calc = DriftCalculator(df1, df2)

    # Create the plotter
    plotter = DriftPlotter(drift_calc)

    # Plot summary of drift across features
    fig = plotter.plot_multiple()
    plt.tight_layout()
    plt.show()

Feature-Level Distribution Comparison
-------------------------------------

You can also examine the distribution shifts for individual features:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tab_right.drift.drift_calculator import DriftCalculator
    from tab_right.plotting.drift_plotter import DriftPlotter

    # Generate datasets with drift
    np.random.seed(42)
    df1 = pd.DataFrame({
        'numeric': np.random.normal(0, 1, 100),
        'category': np.random.choice(['A', 'B', 'C'], 100, p=[0.5, 0.3, 0.2])
    })

    df2 = pd.DataFrame({
        'numeric': np.random.normal(1, 1.2, 120),
        'category': np.random.choice(['A', 'B', 'C'], 120, p=[0.2, 0.3, 0.5])
    })

    # Create calculator and plotter
    drift_calc = DriftCalculator(df1, df2)
    plotter = DriftPlotter(drift_calc)

    # Plot numerical feature distribution comparison
    fig_numeric = plotter.plot_single('numeric')
    plt.tight_layout()
    plt.show()

Categorical Feature Visualization
---------------------------------

Tab-right also makes it easy to visualize categorical feature drift:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tab_right.drift.drift_calculator import DriftCalculator
    from tab_right.plotting.drift_plotter import DriftPlotter

    # Generate datasets with categorical drift
    np.random.seed(42)
    df1 = pd.DataFrame({
        'numeric': np.random.normal(0, 1, 100),
        'category': np.random.choice(['A', 'B', 'C'], 100, p=[0.5, 0.3, 0.2])
    })

    df2 = pd.DataFrame({
        'numeric': np.random.normal(1, 1.2, 120),
        'category': np.random.choice(['A', 'B', 'C'], 120, p=[0.2, 0.3, 0.5])
    })

    # Create calculator and plotter
    drift_calc = DriftCalculator(df1, df2)
    plotter = DriftPlotter(drift_calc)

    # Plot categorical feature distribution comparison
    fig_cat = plotter.plot_single('category')
    plt.tight_layout()
    plt.show()

Direct Functions API
--------------------

For simpler use cases, tab-right also provides direct functions for drift analysis:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tab_right.drift import univariate
    from tab_right.plotting import DriftPlotter

    # Generate datasets
    np.random.seed(42)
    df_ref = pd.DataFrame({
        'num_feature': np.random.normal(0, 1, 500),
        'cat_feature': np.random.choice(['A', 'B', 'C'], 500)
    })

    df_cur = pd.DataFrame({
        'num_feature': np.random.normal(0.3, 1.2, 500),
        'cat_feature': np.random.choice(['A', 'B', 'C'], 500, p=[0.2, 0.5, 0.3])
    })

    # Calculate drift across all features
    result = univariate.detect_univariate_drift_df(df_ref, df_cur)

    # Plot the results using DriftPlotter
    fig = DriftPlotter.plot_drift_mp(None, result)
    plt.tight_layout()
    plt.show()

Working with Multiple Drift Metrics
-----------------------------------

Tab-right supports various drift metrics that can be customized:

.. plot::
    :include-source:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from tab_right.drift import univariate
    from tab_right.drift.drift_calculator import DriftCalculator
    from tab_right.plotting.drift_plotter import DriftPlotter

    # Generate data
    np.random.seed(42)
    df_ref = pd.DataFrame({
        'feat1': np.random.normal(0, 1, 500),
        'feat2': np.random.choice(['A', 'B', 'C'], 500),
    })

    df_cur = pd.DataFrame({
        'feat1': np.random.normal(0.5, 1.5, 500),
        'feat2': np.random.choice(['A', 'B', 'C'], 500, p=[0.5, 0.3, 0.2]),
    })

    # Using DriftCalculator with default metrics
    calc = DriftCalculator(df_ref, df_cur)

    # Create a plotter
    plotter = DriftPlotter(calc)

    # Plot the results
    fig = plotter.plot_multiple()
    plt.title('Drift Analysis with Default Metrics')
    plt.tight_layout()
    plt.show()

Visualizing Different Types of Drift
------------------------------------

Let's look at how different degrees of drift appear in tab-right visualizations:

.. plot::
    :include-source:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from tab_right.drift.drift_calculator import DriftCalculator
    from tab_right.plotting.drift_plotter import DriftPlotter

    # Create datasets with increasing levels of drift
    np.random.seed(42)
    ref_data = np.random.normal(0, 1, 500)

    # Create three datasets with different levels of drift
    slight_drift = np.random.normal(0.2, 1.1, 500)  # slight drift
    moderate_drift = np.random.normal(0.5, 1.3, 500)  # moderate drift
    severe_drift = np.random.normal(2.0, 1.8, 500)  # severe drift

    # Create a figure with 3 subplots
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Set up titles
    titles = ['Slight Drift', 'Moderate Drift', 'Severe Drift']
    drift_data = [slight_drift, moderate_drift, severe_drift]

    # Create and plot each dataset using tab_right
    for i, current_data in enumerate(drift_data):
        # Create DataFrames
        df_ref = pd.DataFrame({'value': ref_data})
        df_cur = pd.DataFrame({'value': current_data})

        # Calculate drift
        drift_calc = DriftCalculator(df_ref, df_cur)
        drift_result = drift_calc()
        drift_score = round(drift_result.iloc[0]['score'], 3)

        # Create plotter
        plotter = DriftPlotter(drift_calc)

        # Plot distribution on the corresponding subplot
        dist_fig = plotter.plot_single('value')

        # Remove the original figure and copy its content to our subplot
        for line in dist_fig.axes[0].lines:
            axes[i].plot(line.get_xdata(), line.get_ydata(),
                         color=line.get_color(), label=line.get_label())

        # Set title with drift score
        axes[i].set_title(f"{titles[i]}\nDrift Score: {drift_score}")
        axes[i].legend()

        # Close the original figure to prevent display
        plt.close(dist_fig)

    plt.tight_layout()
    plt.show()

Key Features of tab-right's Drift Detection
-------------------------------------------

Tab-right offers comprehensive drift detection capabilities:

- **Flexible API**: Choose between object-oriented (DriftCalculator/DriftPlotter) or functional approaches
- **Automatic feature type detection**: Appropriate metrics are selected based on the data type
- **Multiple drift metrics**: Including Wasserstein distance, KS test, and Cramer's V
- **Matplotlib integration**: Create publication-ready plots with built-in matplotlib figures
- **Multi-feature analysis**: Analyze drift across all features at once
- **Probability density comparison**: Examine detailed distribution changes

These tools make it easy to track and analyze distribution shifts in your data, helping you maintain model performance over time.