Higgs to WW nanoAOD Analysis

I am developing a analysis pipeline for the Higgs boson decaying into two W bosons ($H \to WW$) in the Opposite-Sign, Different-Flavor ($e\mu$) final state. This project utilizes 2016 Ultra Legacy (UL) CMS Open Data to probe Standard Model physics while evaluating the scope of LHC Open Data to empower researchers and students without full collaboration access.

Physics pipeline: The $e\mu$ Channel

The $e\mu$ channel ($gg \to WW \to e\nu\mu\nu$) is specifically chosen to eliminate massive backgrounds from $Z \to \ell\ell$ decays, providing a cleaner signal sample. Key aspects of the strategy include:

Kinematic Discrimination: Utilizing the Higgs Transverse Mass ($m_T^H$) as the primary discriminator.
Spin Correlation: Leveraging the small opening angle ($\Delta\phi_{\ell\ell}$) characteristic of spin-0 Higgs decays to distinguish signal from continuum $WW$ production.
Background Suppression: Applying a b-jet veto to heavily suppress the dominant Top-quark ($t\bar{t}$) background and using $E_T^{miss}$ requirements to reject Drell-Yan processes.

Analysis Strategy & Methodology

The pipeline follows a “Cut-and-Count” methodology implemented with a columnar approach:

Object Selection: Implementation of trigger requirements, “Tight” lepton identification, and kinematic cuts.
Categorization: Sorting events into orthogonal 0-Jet, 1-Jet, and 2-Jet bins to isolate Gluon-Fusion (ggH) signal regions.
Control Regions: Defining dedicated Top and Drell-Yan ($\tau\tau$) control regions to normalize background estimations directly from data.
Efficiency Studies: Applying Scale Factors (SF) for Trigger, Lepton ID, and Isolation to correct simulation weights.

Technical pipeline & Scalability

Instead of traditional event-loops, this pipeline utilizes columnar processing within the Scikit-HEP ecosystem to process millions of events in parallel:

Core Stack: Leveraging Uproot for I/O, Awkward Array for jagged data structures, Vector for 4-vector arithmetic, and Hist for multi-dimensional yield accumulation.
Distributed Computing: Using Dask to scale the analysis from a local machine to a cluster, reducing processing time from hours to minutes.
Statistical Inference: Preparing datacards for the CMS Higgs Combine Tool to perform simultaneous profile likelihood fits and limit setting.

This work is part of my contribution to the HSF-India initiative, aiming to modernize HEP computing and democratize physics research through open science and reproducible data pipelines.