.. _seg_double:

Double Segmentation
===================

This page demonstrates how to use tab-right's double segmentation functionality to analyze model performance across combinations of two features.

What is Double Segmentation?
----------------------------

Double segmentation allows you to analyze how your model performs across different combinations of two features. This is useful for:

- Identifying feature interactions affecting model performance
- Finding specific feature value combinations where your model underperforms
- Understanding complex patterns single-feature analysis might miss

Tab-right's Double Segmentation Tools
-------------------------------------

Tab-right provides these tools for double segmentation analysis:

1. ``DoubleSegmentationImp`` - Main class for performing double segmentation
2. ``DoubleSegmPlotting`` - Visualization with support for both interactive Plotly and static Matplotlib backends

Basic Usage with Continuous Features
------------------------------------

Here's a simple example of double segmentation with tab-right using continuous features:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.metrics import mean_squared_error
    from tab_right.segmentations import DoubleSegmentationImp
    from tab_right.plotting import DoubleSegmPlotting

    # Create sample data
    np.random.seed(42)
    n_samples = 500

    # Generate features and target
    feature1 = np.random.normal(0, 1, n_samples)
    feature2 = np.random.normal(0, 1, n_samples)

    # Target with interaction effect
    target = 2 + feature1 + feature2 + 2 * (feature1 * feature2) + np.random.normal(0, 1, n_samples)

    # Prediction missing the interaction term
    prediction = 2 + feature1 + feature2 + np.random.normal(0, 1, n_samples)

    # Create DataFrame
    df = pd.DataFrame({
        'feature1': feature1,
        'feature2': feature2,
        'target': target,
        'prediction': prediction
    })

    # Perform double segmentation
    double_seg = DoubleSegmentationImp(
        df=df,
        label_col='target',
        prediction_col='prediction'
    )

    # Apply segmentation with 3 bins for each feature
    result_df = double_seg(
        feature1_col='feature1',
        feature2_col='feature2',
        score_metric=mean_squared_error,
        bins_1=3,
        bins_2=3
    )

    # Visualize results with a heatmap
    plotter = DoubleSegmPlotting(df=result_df, backend="matplotlib")
    fig = plotter.plot_heatmap()
    plt.title("MSE by Feature1 and Feature2 Segments")

Working with Categorical Features
---------------------------------

Double segmentation works with categorical features without needing to specify bins:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.metrics import accuracy_score
    from tab_right.segmentations import DoubleSegmentationImp
    from tab_right.plotting import DoubleSegmPlotting

    # Create sample categorical data
    np.random.seed(42)
    n = 800

    # Generate categorical features with non-uniform distributions
    category1 = np.random.choice(
        ['A', 'B', 'C', 'D'],
        n,
        p=[0.4, 0.3, 0.2, 0.1]  # Different probabilities for each category
    )
    category2 = np.random.choice(
        ['X', 'Y', 'Z'],
        n,
        p=[0.5, 0.3, 0.2]
    )

    # Generate target with different patterns for combinations
    target = np.zeros(n, dtype=int)

    # Add different effects for different combinations
    target[(category1 == 'A') & (category2 == 'X')] = 1
    target[(category1 == 'B') & (category2 == 'Y')] = 1
    target[(category1 == 'C') & (category2 == 'Z')] = 1
    # Special case with stronger effect
    target[(category1 == 'D') & (category2 == 'Z')] = np.random.binomial(1, 0.8, np.sum((category1 == 'D') & (category2 == 'Z')))

    # Add some noise
    noise_mask = np.random.choice([True, False], n, p=[0.1, 0.9])
    target[noise_mask] = 1 - target[noise_mask]

    # Simple prediction without capturing all patterns
    prediction = np.zeros(n, dtype=int)
    prediction[category1 == 'A'] = 1
    prediction[category2 == 'Z'] = 1

    # Create DataFrame
    cat_df = pd.DataFrame({
        'category1': category1,
        'category2': category2,
        'target': target,
        'prediction': prediction
    })

    # Perform double segmentation
    cat_seg = DoubleSegmentationImp(
        df=cat_df,
        label_col='target',
        prediction_col='prediction'
    )

    # Apply segmentation (no bins needed for categorical features)
    cat_results = cat_seg(
        feature1_col='category1',
        feature2_col='category2',
        score_metric=accuracy_score
    )

    # Plot with higher is better for accuracy
    cat_plot = DoubleSegmPlotting(
        df=cat_results,
        lower_is_better=False,
        backend="matplotlib"
    )
    fig = cat_plot.plot_heatmap()
    plt.title("Accuracy by Category Segments")

Mixed Categorical and Continuous Features
-----------------------------------------

Double segmentation can analyze combinations of categorical and continuous features:

.. plot::
    :include-source:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.metrics import f1_score
    from tab_right.segmentations import DoubleSegmentationImp
    from tab_right.plotting import DoubleSegmPlotting

    # Create sample data with mixed feature types
    np.random.seed(42)
    n_samples = 500

    # Generate categorical feature - product type
    product_types = ['Basic', 'Standard', 'Premium', 'Enterprise']
    product = np.random.choice(product_types, n_samples, p=[0.4, 0.3, 0.2, 0.1])

    # Generate continuous feature - customer spending
    spending = np.random.gamma(shape=5, scale=20, size=n_samples)

    # Add variation by product type
    spending[product == 'Premium'] *= 1.5
    spending[product == 'Enterprise'] *= 2.0

    # Simple model: customers return if they have premium products OR spend a lot
    premium_mask = np.logical_or(product == 'Premium', product == 'Enterprise')
    return_prob = 0.2 + 0.3 * premium_mask + 0.4 * (spending > np.percentile(spending, 70))
    return_prob = np.clip(return_prob, 0.1, 0.9)

    # Generate actual returns (target)
    customer_return = np.random.binomial(1, return_prob)

    # Simple prediction (missing some patterns)
    pred_prob = 0.2 + 0.4 * (product == 'Enterprise') + 0.3 * (spending > np.percentile(spending, 80))
    pred_prob = np.clip(pred_prob, 0.1, 0.9)
    prediction = np.random.binomial(1, pred_prob)

    # Create DataFrame
    mixed_df = pd.DataFrame({
        'product': product,
        'spending': spending,
        'target': customer_return,
        'prediction': prediction
    })

    # Perform double segmentation
    mixed_seg = DoubleSegmentationImp(
        df=mixed_df,
        label_col='target',
        prediction_col='prediction'
    )

    # Apply segmentation
    mixed_results = mixed_seg(
        feature1_col='product',
        feature2_col='spending',
        score_metric=f1_score,
        bins_2=4  # 4 bins for spending
    )

    # Plot with higher is better for F1 score
    mixed_plot = DoubleSegmPlotting(
        df=mixed_results,
        lower_is_better=False,
        backend="matplotlib"
    )
    fig = mixed_plot.plot_heatmap()
    plt.title("F1 Score by Product Type and Spending")

Interactive Visualization with Plotly
-------------------------------------

Tab-right also offers interactive Plotly visualization:

.. code-block:: python

    from tab_right.plotting import DoubleSegmPlotting

    # Create interactive visualization from the results
    interactive_plot = DoubleSegmPlotting(df=result_df)
    fig = interactive_plot.plot_heatmap()
    fig.update_layout(title="Interactive Double Segmentation Heatmap")
    fig.show()

Using Different Metrics
-----------------------

You can use any metric compatible with scikit-learn:

.. code-block:: python

    from sklearn.metrics import mean_absolute_error, r2_score

    # Using MAE instead of MSE
    mae_results = double_seg(
        feature1_col='feature1',
        feature2_col='feature2',
        score_metric=mean_absolute_error,
        bins_1=3,
        bins_2=3
    )

    # For metrics where higher is better (like R²)
    r2_results = double_seg(
        feature1_col='feature1',
        feature2_col='feature2',
        score_metric=r2_score,
        bins_1=3,
        bins_2=3
    )

    # Visualize with appropriate settings
    r2_plotter = DoubleSegmPlotting(df=r2_results, lower_is_better=False, backend="matplotlib")
    r2_plotter.plot_heatmap()
    plt.title("R² Score by Feature Segments")

Key Features of Double Segmentation
-----------------------------------

- **Discover interactions**: Find how combinations of features affect performance
- **Automatic handling**: Works with both numerical and categorical features
- **Flexible metrics**: Compatible with any scikit-learn metric
- **Visual insights**: Interactive and static visualization options
- **Performance diagnosis**: Quickly identify problem areas in your model

Double segmentation provides deeper insights than single-feature analysis, helping you better understand your model's behavior across different data segments.