Classifying Higgs Data in R

Shashank Kumar
Aug 1, 2015
5 min read

We recently discovered Higgs boson. The Nobel prize in Physics for the year 2013 was awarded to Francois Englert and Peter Higgs for the discovery of Higgs particle.

The higgs boson decays through different processes. Decay of a particle into other particles is called CHANNELS. It has been observed that the Higgs boson decay through three decay channels which are all boson pairs. It has also been recently obsorbed that Higgs particle decays into fermion pair (tau-leptons). During this article i will try to discuss the ways to analyze the data of decay of Higgs boson into tau-particles and build a classifier using neural network to classify the data into signal and background events.

The Detector and the analysis:-

In LHC proton bunches are accelerated on a circular trajectory in both direction, when these bunces cross the ATLAS detector some of these protons collide producing hundreds of millions of collision per second. Now there are different levels of online and offline trigger to select what we call events.

These events in majority represents known processes called background.

During this try i have tried to model a classifier to classify data set into background and signal events. Signal events being the events in which a Higgs particle decay into two tau-particle. I have used neural net machine learning techniques to train and build a model.

The data was from LHC open source data and comprised of 33 features and 250000 data points.Out of these 33 features in data, i first removed id, label(background or signal), and weight.

As the data is simulated and reconstructed a weight is provided against each data point to provide a relation between the probability of actual data from detectors and simulated data. Simulated data helps us in many aspect as in actual data the proportion of signal to background is very very small simulating the data helps us keeping the data balanced.

Some questions which i tried to find answer of during the course of my work.

Understanding the statistical aspects of data:- I went through many books and notes for understanding data and tried various data visualization techniques to visualizing the relation between the features using R statistical.. During this i plotted different features in data. Also tried to understand the correlation and covariance matrix of the data.
Understanding the concept behind data generation:- As the data i have was not an actual data from a detector rather a simulated data. I had to work on the relation between the simulated data and the actual data. A separate feature is given in the data to help us visualize this relation(weights). Weights is proportional to conditional density of simulated event divided by the instrumental density used in the simulator.
Data preprocessing:- As it wasn’t computationally possible for the neural net to find the weight for all 31 features having 250000 data points. I had to think of some way to select few features in the data that could give us maximum variance in the data and also make learning computationally viable. So i applied Principal component analysis on my data to find the dimensions in which my data has maximum variance. I saw that my first 3 PC are providing about 98% of the variance so i selected 3 PC’s for my further analysis.
Training the new data:- After finding the principle component and thus reducing my dimensions to 3 from 31, my aim was to train this new transformed data. For that i used the neuralnet package in R. I tried to train the data using 2 hidden layers and got satisfactory results upon testing. ie I was able to classify about 80% of data points correctly into background and signal.

Some concepts that i used in my code for building the classifier:-

1) Principle Component Analysis:- PCA is basically a mathematical way to find new dimensions in data along which data have maximum variance. Advantage that it provide us is that now we can have very few principal components that can efficiently classify our data points. There is also a disadvantage of using PCA, that is with every change in initial training data our principle component will change. Thus it can't be easily used in real time processes.

We can understand through these two basic pictures that Horizontal line is a principle component for this data set of triangles because the projection of data points on to the horizontal line is more spaced than that on the vertical line. This means the data is more spread out in horizontal line and thus it has more variance. This is the basic idea behind PCA.

2) Neural networks:- Neural network is one of the machine learning techniques that are being widely used these days to extract the hidden patterns from the data. The idea behind it is quite simple and co-relate very much to the concept of dendrite and axon connections in human brain.

The basic idea is to find the weights corresponding to all the features or combination of features so as to minimize the cost function. Different algorithms(eg- gradient descent) can be used to minimize that cost function. At the end of the day all learning algorithm are just a little complicated optimization problem, where we try to maximize and minimize two or more variables at the same time.

3) Cross Validation:- there is a famous saying in Data Analytics world “Trust your CV result not your test result” . So the question arises what is so peculiar about cross validation that people trust it more. And the answer is basically it is a kind of genralization test, it tests how well does our model generalize to new data. There are some data specific precautions that should be taken care of while applying different kind of CV. CV also helps you to select in the model by comparing with CV results we can easily decide which model or learning method should we apply. In CV we divide our training set into two parts bigger portion (80-90%) for training and rest for testing and we do this so that whole of the training set is covered once for testing. This result for the training set is compared with labels we have, which give us an idea of how well our model will apply on the test data set.

Though data visualization and pre processing part is irritating but that is the most important part of any data analysis project, the data visulaization help you in making strategy to analyze the data.

for my code or data or any query:- sajagsasa@gmail.com

Please do point out my mistakes and do give suggestions.

#bigdata #analytics #LHC #CERN #HiggsBoson #R