# Data Analysis with Python¶

T.J. Langford
Wright Lab & YCRC
October 28, 2020

# Overview¶

• Introduction to numpy and matplotlib
• Data processing and analysis with numpy
• Data visualization with matplotlib

# Tools and Requirements¶

• Language: Python 3.6
• Modules: numpy, matplotlib
• Jupyter notebook

# Comment: Python 2 versus 3¶

• Major modules dropped Python2 support in 2019
• Including numpy, pandas, and matplotlib
• This tutorial uses Python3
• see https://python3statement.org for details

# Data Processing with numpy¶

In :
import numpy as np


# What is Numpy?¶

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

• a powerful N-dimensional array object
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities

User Guide

## N-dimensional array objects¶

• Fundamental basis of numpy is the array object
• 1D array ~ vector
• 2D array ~ matrix
• nD array (n > 2) ~ tensor

### Creating arrays¶

Arrays can be created in a variety of ways. The most common are either empty:

In :
a = np.zeros(10)
a

Out:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

or you can create them from an existing list:

In :
b = np.array([0,1,2,3,4,5,6,7,8,9])
b

Out:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Array properties¶

Arrays have a few key properties:

• Data type (float, int, etc)
• Number of dimensions
• Shape
• Size
In :
print(a.dtype)
print(b.dtype)

float64
int64

In :
a.shape

Out:
(10,)
In :
c = np.array([[0,1,2,3],[4,5,6,7]])
c

Out:
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
In :
c.shape

Out:
(2, 4)

### Array indexing and slicing¶

Arrays are indexed (starting at 0) and we can slice them:

In :
b

Out:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In :
b[2:4]

Out:
array([2, 3])
In :
b[0:-2]

Out:
array([0, 1, 2, 3, 4, 5, 6, 7])
In :
b[::2]

Out:
array([0, 2, 4, 6, 8])

We also have "fancy indexing" for n-dimensional arrays:

In :
c[:,2]

Out:
array([2, 6])

### Array manipulation¶

Arrays can be manipulated in-place, without creating a second copy:

In :
b = 10
b

Out:
array([ 0,  1,  2, 10,  4,  5,  6,  7,  8,  9])

You can assign a range of values at once, either to a single value or from another array:

In :
a[0:2] = 9
a[3:7] = b[0:4]

In :
a

Out:
array([ 9.,  9.,  0.,  0.,  1.,  2., 10.,  0.,  0.,  0.])

The assignment isn't linked, so changing b now doesn't change a:

In :
b = 10
a

Out:
array([ 9.,  9.,  0.,  0.,  1.,  2., 10.,  0.,  0.,  0.])

### Array operations¶

We can also act on these arrays with specific operations:

• add, subtract, multiply, and divide by scalars or other arrays
In :
np.multiply(a, b)

Out:
array([ 0.,  9.,  0.,  0.,  4., 10., 60.,  0.,  0.,  0.])
• extract statistics about the array (minimum, maximum, RMS, etc)
In :
np.max(b)

Out:
10
In :
np.mean(a)

Out:
3.1
In :
np.median(a)

Out:
0.5
• sum array elements along an axis:
In :
np.sum(c, axis=1)

Out:
array([ 6, 22])

### Import and Export Arrays¶

Numpy has two main ways of importing and exporting data:

In :
np.savetxt('test.txt', c, fmt='%f', delimiter=',', header='My favorite array')

In :
cat test.txt

# My favorite array
0.000000,1.000000,2.000000,3.000000
4.000000,5.000000,6.000000,7.000000

• higher-efficiency binary data:
In :
np.save('test.npy', c)

In :
ls -l test*

-rw-r--r--  1 langford  staff  192 Oct 21 11:22 test.npy
-rw-r--r--  1 langford  staff   92 Oct 21 11:22 test.txt


## Random Number Generation with numpy¶

Numpy has a full suite of tools for generating random numbers. Very helpful for Monte Carlo simulations or toy data.

Here we will generate 100k random floats from a normal distribution with mean = 2.0 and sigma = 1.0.

In :
r = np.random.normal(loc=2, scale=1, size=100000)
print(r[0:10])

[1.36524821 1.84659083 1.65146673 1.72326769 2.44126153 3.32639604
0.89162056 1.80902509 4.09851638 0.60734383]


We can randomly select elements from an array:

In :
np.random.choice(a, size=2)

Out:
array([0., 1.])

All the "heavy-lifting" is done in C, so numpy-based work can be very fast.

### Random Number Example: Monte Carlo pi¶

We can perform a Monte Carlo-based simulation to calculate pi using two uniform random number generators.

In :
def mc_pi(num_trials):
x = np.random.uniform(low=0.0, high=1.0, size=num_trials)
y = np.random.uniform(low=0.0, high=1.0, size=num_trials)

r = x**2 + y**2

return len(r[r<1])*4/num_trials


In :
for n in [10,100,1000,10000,100000,1000000]:
print(f"{n}: {mc_pi(n)}")

10: 3.2
100: 3.16
1000: 3.1
10000: 3.1336
100000: 3.14468
1000000: 3.14098


## Plotting with Matplotlib¶

In :
import matplotlib.pyplot as plt


## Plotting with Matplotlib¶

• Matplotlib is the back-bone of most python plotting and visualization
• It's super flexible and produces both interactive and publication-ready plots
• We'll touch on a few examples to demonstrate the versitility of this module

## Basic 2D Plot¶

In :
x = np.array([0,1,2,3,4])
y = np.power(x, 2)
plt.plot(x, y, marker='o', label='Square');
plt.xlabel('X-values');
plt.ylabel('Y-values');
plt.legend(loc=2);


## Scatter plot¶

In :
x = np.array([0,2,4,6,8,10])
y = np.power(x, 2)
r = np.power(x, 3)
plt.scatter(x, y, s=r)
plt.xlabel('X-values'); plt.ylabel('Y-values');


## Data Processing with Numpy¶

Now that we have a basic familiarity with numpy, we will attempt to process some low-level data produced by a PMT connected to a scintillator cell.

A few key details about the data file:

• Recorded by a digitizer (similar to an oscilloscope, records triggered waveforms)
• 2-character unsigned binary data
• Each trigger window is 300 samples long
• Each trigger has a 6 long-word header containing information about the event:
• Time-stamp, event number, channel ID, etc.
In :
data_file = '../data/180822-EJ309-1700V-Cf252.dat'

In :
f = open(data_file, 'r')

In :
header = np.fromfile(f, dtype='I', count=4)

trace = np.fromfile(f, dtype='<H', count=300)
plt.plot(trace)

Channel: 1, Trigger: 0, TimeStamp: 294763

Out:
Text(0, 0.5, 'ADC')

### Analysis Function¶

In :
def data_reader(file_name, num_ev=10000):
window_size = 300

with open(file_name, 'r') as d:
# define data structure
dtype = [('time_stamp','f4'), ('en','f4'), ('psd', 'f4')]
data = np.zeros(shape=num_ev, dtype=dtype)

for i in range(num_ev):

trace = np.fromfile(d, dtype='<H', count=window_size)

# Average the first 30 samples to determine what "zero" is
bl = np.average(trace[0:30])

# Subtract off the baseline
trace = np.subtract(trace, bl)

# find the minimum of the pulse
peak_loc = np.argmin(trace)

# Integrate the full pulse
data['en'][i] = -np.sum(trace[peak_loc-30:peak_loc+100])

# Integrate the tail of the pulse, then take the ratio to the full
data['psd'][i] = -np.sum(trace[peak_loc+10:peak_loc+100])
data['psd'][i] /= data['en'][i]

return data

In :
data = data_reader(data_file, num_ev=30000)

<ipython-input-36-a2efc4a279d1>:31: RuntimeWarning: divide by zero encountered in float_scalars
data['psd'][i] /= data['en'][i]

In :
plt.plot(data['time_stamp'][0:100], marker=',')
plt.xlabel('Event Counter');
plt.ylabel('Time Stamp');

In :
plt.hist(data['en'], bins=200, range=(0,100000), histtype='step');
plt.yscale('log');

plt.hist2d(data['en'], data['psd'], bins=50, range=((0,50000),(0,0.5)), norm=LogNorm());