Introduction to Data Analysis with Python¶

T.J. Langford
Vincent Balbarin
Yale Center for Reasearch Computing
October 25, 2022

Overview¶

  • Introduction to Jupiter Docker Stacks
  • Introduction to podman
  • Introduction to numpy and matplotlib
  • Data processing and analysis with numpy
  • Data visualization with matplotlib
  • Real-world example of data processing

Tutorial Requirements¶

  • Repository for this tutorial: https://github.com/WrightLaboratory/data_analysis
    • Notebook
    • Test data
    • Supporting scripts
  • jupyter/scipy-notebook Docker image
  • Podman

Jupyter and Docker¶

  • The Jupyter project provides provides pre-built Docker images at Jupyter Docker Stacks
  • No need to install and configure Python virtual environments
  • jupiter/scipy-notebook image includes popular packages from the scientific Python ecosystem

Podman¶

  • Container engine for executing Open Container Initiative (OCI) compliant images
  • podman cli supports many of the same flags as docker cli
  • Available in Linux package repositories, podman machine running on a WSL VM in Windows, and podman machine running on QEMU VM in MacOS

Podman Security¶

  • Daemon-less and capable of running rootless
  • Integrates with systemd
  • Kubernetes-like
    • Related application components are grouped into pods
    • Namespaces provide another layer of isolation

Running This Notebook¶

Follow the instructions to install Podman on your operating system of choice.

# Clone the repository
git clone https://github.com/WrightLaboratory/data_analysis data_analysis
# Use the provided script to launch Jupyter container
cd data_analysis
bash ./scripts/podman_jupyterlab.sh
# Find the url in the standard output log trace and open in your browser
...
[C 2022-10-18 03:30:26.665 ServerApp] 

  To access the server, open this file in a browser:
      file:///home/jovyan/.local/share/jupyter/runtime/jpserver-2-open.html
Or copy and paste one of these URLs:
      http://784772c24b81:8888/lab?token=929d6fbdc29185ba5ac86b60c689e84b1f3fa2eb22b46ebc
or http://127.0.0.1:8888/lab?token=929d6fbdc29185ba5ac86b60c689e84b1f3fa2eb22b46ebc
...

Data Processing with numpy¶

In [2]:
import numpy as np

What is Numpy?¶

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

User Guide

N-dimensional array objects¶

  • Fundamental basis of numpy is the array object
  • 1D array ~ vector
  • 2D array ~ matrix
  • nD array (n > 2) ~ tensor

Creating arrays¶

Arrays can be created in a variety of ways. The most common are either empty:

In [3]:
a = np.zeros(10)
a
Out[3]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

or you can create them from an existing list:

In [4]:
b = np.array([0,1,2,3,4,5,6,7,8,9])
b
Out[4]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Array properties¶

Arrays have a few key properties:

  • Data type (float, int, etc)
  • Number of dimensions
  • Shape and Size
In [5]:
print(a.dtype)
print(b.dtype)
float64
int64
In [6]:
a.shape
Out[6]:
(10,)
In [7]:
c = np.array([[0,1,2,3],[4,5,6,7]])
c
Out[7]:
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

Array operations¶

We can also act on these arrays with specific operations:

  • add, subtract, multiply, and divide by scalars or other arrays
In [8]:
a = np.multiply(b, 4)
a
Out[8]:
array([ 0,  4,  8, 12, 16, 20, 24, 28, 32, 36])
In [9]:
np.multiply(a, b)
Out[9]:
array([  0,   4,  16,  36,  64, 100, 144, 196, 256, 324])
  • extract statistics about the array (minimum, maximum, RMS, etc)
In [10]:
np.max(b)
Out[10]:
9
In [11]:
np.mean(a)
Out[11]:
18.0
In [12]:
np.std(a)
Out[12]:
11.489125293076057
  • sum array elements along an axis:
In [13]:
np.sum(c, axis=0)
Out[13]:
array([ 4,  6,  8, 10])

Filter based on values¶

Array elements can be selected based on their values (or values of different arrays)

In [14]:
print(a)
[ 0  4  8 12 16 20 24 28 32 36]
In [15]:
a > 5
Out[15]:
array([False, False,  True,  True,  True,  True,  True,  True,  True,
        True])

This array of True/False values can then filter either a or other arrays:

In [16]:
a[a > 5]
Out[16]:
array([ 8, 12, 16, 20, 24, 28, 32, 36])
In [17]:
b[a > 5]
Out[17]:
array([2, 3, 4, 5, 6, 7, 8, 9])

We will use this to filter some real data later in the tutorial.

Import and Export Arrays¶

Numpy has two main ways of exporting and importing data:

  • Saving human-readable text file with np.savetxt:
In [18]:
np.savetxt('test.txt', c, fmt='%f', delimiter=',', header='My favorite array')
In [19]:
cat test.txt
# My favorite array
0.000000,1.000000,2.000000,3.000000
4.000000,5.000000,6.000000,7.000000
  • faster, more efficient numpy binary data (test.npy):
In [20]:
np.save('test.npy', c)
  • Reading text or binary data with np.loadtxt or np.fromfile
In [21]:
np.loadtxt('test.txt',delimiter=',')
Out[21]:
array([[0., 1., 2., 3.],
       [4., 5., 6., 7.]])
  • Numpy and pandas also provide a variety of tools to read specific data formats (csv, binary, etc.)

Random Number Generation with numpy¶

Numpy has a full suite of tools for generating random numbers. Very helpful for Monte Carlo simulations or toy data.

Here we will generate 100k random floats from a normal distribution with mean = 2.0 and sigma = 1.0.

In [22]:
r = np.random.normal(loc=2, scale=1, size=100000)
print(r[0:10])
[4.59294897 3.15374283 3.04135266 1.28198312 2.19810513 1.51165547
 1.75905533 2.94476109 1.28497692 1.17879256]

We can randomly select elements from an array:

In [23]:
np.random.choice(a, size=2)
Out[23]:
array([24, 32])

All the "heavy-lifting" is done in C, so numpy-based work can be very fast.

Random Number Example: Monte Carlo $\pi$¶

We can perform a Monte Carlo-based simulation to calculate $\pi$ using two uniform random number generators.

In [24]:
def mc_pi(num_trials):
    x = np.random.uniform(low=0.0, high=1.0, size=num_trials)
    y = np.random.uniform(low=0.0, high=1.0, size=num_trials)

    r = x**2 + y**2
    
    return len(r[r<1])*4/num_trials
    
In [25]:
for n in [10,100,1000,10000,100000,1000000,10000000]:
    print(f"{n}: {mc_pi(n)}")
10: 3.2
100: 3.16
1000: 3.028
10000: 3.1464
100000: 3.13692
1000000: 3.142888
10000000: 3.1423868

Plotting with Matplotlib¶

In [26]:
import matplotlib.pyplot as plt

Plotting with Matplotlib¶

  • Matplotlib is the back-bone of most python plotting and visualization
  • It's super flexible and produces both interactive and publication-ready plots
  • We'll touch on a few examples to demonstrate the versitility of this module

Basic 2D Plot¶

In [27]:
x = np.array([0,1,2,3,4,5])
y = np.power(x, 2)
plt.plot(x, y, marker='o', color='k', label='Square');
plt.xlabel('X-values');
plt.ylabel('Y-values');
plt.legend(loc=2);

Scatter plot¶

In [28]:
x = np.array([0,2,4,6,8,10])
y = np.power(x, 2)
r = np.power(x, 3)
plt.scatter(x, y, s=r)
plt.xlabel('X-values'); plt.ylabel('Y-values');

Visualizing Monte Carlo Pi¶

Using our previously discussed Monte Carlo method for determining $\pi$, we can visualize this procedure in a few different ways.

First, we can make a scatter plot of one determination of Pi and show how the algorithm works. X and Y values are generated with a uniform random distribution and any which fall inside the shaded region (where x^2 + y^2 <= 1) are counted.

In [29]:
plt.figure(figsize=(5,5))
ax = plt.gca()
circle = plt.Circle((0,0), radius=1, color='C4', alpha=0.3)

num_trials = 1000
x = np.random.uniform(low=0.0, high=1.0, size=num_trials)
y = np.random.uniform(low=0.0, high=1.0, size=num_trials)

plt.scatter(x,y,marker='.')
ax.add_patch(circle)
plt.ylim(0,1)
plt.xlim(0,1)
Out[29]:
(0.0, 1.0)

Next, we can plot the evolution of our MC-generated value of $\pi$ as we increase the number of samples:

In [30]:
mc_pi_samples = np.logspace(start=1, stop=6, num=100, dtype='int')
mc_pi_values = [mc_pi(x) for x in mc_pi_samples]

plt.plot(mc_pi_samples, mc_pi_values, marker='o', linestyle='None', label='Monte Carlo')
plt.ylim(np.pi-1,np.pi+1)
plt.hlines(np.pi, xmin=0,xmax=np.max(mc_pi_samples), color='k', linestyle='--', label='Pi')
plt.xscale('log');
plt.legend()
plt.xlabel('MC Samples');
plt.ylabel('Simulated Value of Pi')
Out[30]:
Text(0, 0.5, 'Simulated Value of Pi')

Histograms¶

Combining numpy and matplotlib we can begin to generate some of the key plots that are used to display data. One of the most common types of data visualizations are histograms.

To get started, let's generate some random data using numpy.random:

In [31]:
a = np.random.uniform(low=0, high=10, size=100000)
b = np.random.exponential(scale=1, size=100000)

Now we can create a histogram (tabulation of values into discrete bins)

In [32]:
h, bins = np.histogram(a, bins=20, range=(0,10))

and plot the distributions using matplotlib:

In [33]:
plt.stairs(h, bins, label='Data')
plt.hlines(h.mean(), xmin=-0.25,xmax=10.25,linestyles='--', color='k', alpha=0.5, label='Average')
plt.xlabel('Value'); plt.ylabel('Counts/bin')
plt.legend(loc=3)
Out[33]:
<matplotlib.legend.Legend at 0xffff759df0d0>

We can do the same to plot our exponential distribution of data, but this time we can set the y-scale to be log to better highlight the distribution:

In [34]:
h, bins = np.histogram(b, bins=20, range=(0,10))
plt.stairs(h, bins, label='Data')
plt.xlabel('Value'); plt.ylabel('Counts/bin'); plt.yscale('log')
plt.legend(loc=0)
Out[34]:
<matplotlib.legend.Legend at 0xffff77bba560>

Data Processing with Numpy¶

Now that we have a basic familiarity with numpy, we will attempt to process some low-level data produced by a PMT connected to a scintillator cell.

A few key details about the data file:

  • Recorded by a digitizer (similar to an oscilloscope, records triggered waveforms)
  • 2-character unsigned binary data
  • Each trigger window is 300 samples long
  • Each trigger has a 4 long-word header containing information about the event: Trigger Event Size + Check Bit, Channel, Trigger, Timestamp

Set the path to the data file and specify the trigger event properties.

In [35]:
from pathlib import Path

data_file = '180822-EJ309-1700V-Cf252.dat'

HEADER_SIZE = 4
WINDOW_SIZE = 300

data_file_path = Path.cwd().parent.joinpath('data', data_file)

Get header from the beginning of the file and first event trace.

In [36]:
header = np.fromfile(data_file_path, dtype=np.uint32,
                     count=HEADER_SIZE)

# Since each header item is 1 long-word (4 bytes)
# The data for 1st event begins 4 * HEADER_SIZE from the beginning of file
trace = np.fromfile(data_file_path, dtype=np.uint16,
                    count=WINDOW_SIZE, offset=HEADER_SIZE*4)

Plot the first trigger event trace.

In [37]:
plt.title(f"Channel: {header[1]}, Trigger: {header[2]}, TimeStamp: {header[3]}")
plt.plot(trace)
plt.xlabel('Samples');plt.ylabel('ADC');

Remove baseline counts

In [38]:
plt.title(f"Channel: {header[1]}, Trigger: {header[2]}, TimeStamp: {header[3]}")
plt.plot(trace)
plt.axvspan(xmin=0,xmax=50, color='C2',alpha=0.2)
plt.axvspan(xmin=50,xmax=300, color='C3',alpha=0.2)
plt.xlabel('Samples'); plt.ylabel('ADC');

Calculate the baseline from the average of the first 51 samples.

In [39]:
baseline = np.average(trace[0:50])
print(f"baseline: {baseline:0.2f}ADC")
baseline: 14614.12ADC

Subtract the baseline value from all the elments of the trace array.

In [40]:
trace = np.subtract(trace, baseline)
In [41]:
plt.title(f"Channel: {header[1]}, Trigger: {header[2]}, TimeStamp: {header[3]}")
plt.plot(trace)
plt.axvspan(xmin=0,xmax=50, color='C2',alpha=0.2)
plt.axvspan(xmin=50,xmax=300, color='C3',alpha=0.2)
plt.xlabel('Samples');plt.ylabel('ADC');

Calculate the signal integral for the trigger event.

In [42]:
sig_int = -np.sum(trace)
print(f"signal integral: {sig_int:.2f}ADC counts")
signal integral: 913.00ADC counts

Analysis Function¶

In [43]:
def data_reader(file_path, header_size=4, window_size=300, num_ev=10000):

    with file_path.open(mode='r') as d:
        # define data structure
        dtype = [('time_stamp','f4'), ('en','f4'), ('psd', 'f4')]
        data = np.zeros(shape=num_ev, dtype=dtype)
        
        for i in range(num_ev):
            header = np.fromfile(d, dtype=np.uint32, count=header_size)            
            data['time_stamp'][i] = header[3]
            trace = np.fromfile(d, dtype=np.uint16, count=window_size)
            
            # Average the first 30 samples to determine what "zero" is
            bl = np.average(trace[0:30])

            # Subtract off the baseline 
            trace = np.subtract(trace, bl)

            # find the minimum of the pulse
            peak_loc = np.argmin(trace)
            
            # Integrate the full pulse
            data['en'][i] = -np.sum(trace[peak_loc-30:peak_loc+100])
            
            # Integrate the tail of the pulse, then take the ratio to the full
            data['psd'][i] = -np.sum(trace[peak_loc+10:peak_loc+100])
            data['psd'][i] /= data['en'][i]
            
    return data
In [44]:
data = data_reader(data_file_path,
                   header_size=HEADER_SIZE,
                   window_size=WINDOW_SIZE,
                   num_ev=50000)
/tmp/ipykernel_7262/2814272060.py:27: RuntimeWarning: divide by zero encountered in float_scalars
  data['psd'][i] /= data['en'][i]
In [46]:
plt.plot(data['time_stamp'][0:1000], marker=',')
plt.xlabel('Event Counter');
plt.ylabel('Time Stamp');
In [47]:
h, bins = np.histogram(data['en'], bins=200, range=(0,100000))

plt.stairs(h, bins)
plt.yscale('log');
plt.xlabel('Pulse Integral (ADC)');
plt.ylabel('Counts/bin');
In [48]:
plt.hist2d(data['en'], data['psd'], bins=50, range=((0,50000),(0,0.5)), norm=LogNorm());
plt.xlabel('Pulse Integral (ADC)');
plt.ylabel('Tail-fraction (%)');
In [49]:
filter = data['en']>15000
h, bins = np.histogram(data['psd'][filter], bins=200, range=(0,0.5))
plt.stairs(h, bins, label='high-energy');

filter = data['en']<15000
h, bins = np.histogram(data['psd'][filter], bins=200, range=(0,0.5))
plt.stairs(h, bins, label='low-energy');

plt.xlabel('Tail-fraction (%)');plt.ylabel('Counts/bin'); plt.legend();

Aside: Data Preparation with Pandas¶

The pandas library builds on top of numpy. It excels at processing tabular data of heterogenous types.

This example deals with an Excel worksheet used to track project activities.

  • It uses multiple rows to denote column headings.
  • It comprises columns of data type string and float.
  • The data must be prepared so that it maps cleanly to the fields of a relational database.

Load the necessary modules

In [50]:
import numpy as np
import pandas as pd
from pathlib import Path

Load the Excel worksheet into a data frame

In [51]:
data_file = '02-VA BMTv3 20220225.xlsx'
worksheet_name = 'OUTPUT'

data_file_path = Path.cwd().parent.joinpath('data', data_file)
df = pd.read_excel(data_file_path, sheet_name=worksheet_name, header=[0, 1, 2])
In [52]:
df
Out[52]:
INDEX VALUE Priority Team Project Mgr. Building Code Building ID Building Description Cluster # # closets Legacy Switch Count ... Building Transitions Building Transitions
Unnamed: 0_level_1 Unnamed: 1_level_1 Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1 Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 ... Healthcare_VN Voice_VN PCI_VN Building Communication Wired Full Building Wired Network Services Cutover to NGN Building Comms Wireless Wireless Cutover to NGN Legacy Cable & Equipment Removal Building Comms Completion Lessons Learned
Unnamed: 0_level_2 Unnamed: 1_level_2 Unnamed: 2_level_2 Unnamed: 3_level_2 Unnamed: 4_level_2 Unnamed: 5_level_2 Unnamed: 6_level_2 Unnamed: 7_level_2 Unnamed: 8_level_2 Unnamed: 9_level_2 ... 33 34 35 36 37 38 39 40 41 42
0 1.0 1.0 NaN NaN VAMC-01 4160 VA MEDICAL CTR BLDG 01 C02-02 MDF 9 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2.0 2.0 NaN NaN VAMC-02 4161 VA MEDICAL CTR BLDG 02 C02-02 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.0 3.0 NaN NaN VAMC-03 41?? VA MEDICAL CTR BLDG 03 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 4.0 NaN NaN VAMC-04 4162 VA MEDICAL CTR BLDG 04 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5.0 5.0 NaN NaN VAMC-05 4164 VA MEDICAL CTR BLDG 05 C02-01 MDF 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6.0 6.0 NaN NaN VAMC-08 41?? VA MEDICAL CTR BLDG 08 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7.0 7.0 NaN NaN VAMC-08.5 41?? VA MEDICAL CTR BLDG 08.5 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 8.0 8.0 NaN NaN VAMC-12A 41?? VA MEDICAL CTR BLDG 12A C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 9.0 9.0 NaN NaN VAMC-34 4020 VA MEDICAL CTR BLDG 34 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 10.0 10.0 NaN NaN VAMC-35 41?? VA MEDICAL CTR BLDG 35 C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 11.0 11.0 NaN NaN VAMC-35A 4165 VA MEDICAL CTR BLDG 35A C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 12.0 12.0 NaN NaN VAMC-36 41?? VA MEDICAL CTR BLDG 36 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 NaN NaN NaN NaN NaN NaN NaN NaN 28 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN

29 rows × 63 columns

Collapse the column names by iterating through the list of the dataframes columns and creating a new list of column names, replacing any Unnamed column with the value of the column name in level l2)

In [53]:
flattened_column_names = [
    (l1 if ('Unnamed:' in l2) else l2 ) for l1, l2, l3 in df.columns]
In [54]:
flattened_column_names
Out[54]:
['INDEX VALUE',
 'Priority',
 'Team',
 'Project Mgr.',
 'Building Code',
 'Building ID',
 'Building Description',
 'Cluster #',
 '# closets',
 'Legacy Switch Count',
 'NGN Switch Count',
 "AP's",
 'Assigned General Building VN',
 'Cutover Type',
 'Known special use networks in building',
 'Fabric Site Designation',
 'Wired Migration Date\n(tentative)',
 'Wireless Migration Date',
 'STAGE 1',
 'STAGE 2/3',
 'Building',
 'Program and Building Communication',
 'Switch Discovery',
 'Extension Discovery & Design',
 'Inter-Fiber Discovery & Deployment',
 'Building Comms Closet Communication',
 'Closet Surveys',
 'Bldg/Clstr -InterConnected Design',
 'IP Discovery',
 'Speciality VN Device Discovery',
 'Firewall Review',
 'WAP Predictive Review',
 'Business Questionnaire',
 'Building Comms DDNS Awareness',
 'DDNS',
 'IP, AV, & Printer Remediation',
 'Walkthrough & Intrabuilding Architecture Design',
 'BOM Process',
 'Closet Remediation',
 'Switch Hardware Staging',
 'Logistics',
 'Switch Rack & Stack',
 'LAN Automation (Fabric Connection)',
 'Building Communications',
 'WAP Prep',
 'QA Testing',
 'Printers to NGN',
 'Building Comms - User Specific App Testing Awareness - Building Ops Awareness',
 'Specialty Devices to NGN',
 'User Specific Testing',
 'Safety_VN',
 'BLDGSYS_VN',
 'ScienceNet to NGN',
 'Healthcare_VN',
 'Voice_VN',
 'PCI_VN',
 'Building Communication Wired ',
 'Full Building Wired Network Services Cutover to NGN',
 'Building Comms Wireless ',
 'Wireless Cutover to NGN',
 'Legacy Cable & Equipment Removal',
 'Building Comms Completion',
 'Lessons Learned']

Rename the data frame columns with the flattened_column_names

In [55]:
df.columns = flattened_column_names
In [56]:
df
Out[56]:
INDEX VALUE Priority Team Project Mgr. Building Code Building ID Building Description Cluster # # closets Legacy Switch Count ... Healthcare_VN Voice_VN PCI_VN Building Communication Wired Full Building Wired Network Services Cutover to NGN Building Comms Wireless Wireless Cutover to NGN Legacy Cable & Equipment Removal Building Comms Completion Lessons Learned
0 1.0 1.0 NaN NaN VAMC-01 4160 VA MEDICAL CTR BLDG 01 C02-02 MDF 9 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2.0 2.0 NaN NaN VAMC-02 4161 VA MEDICAL CTR BLDG 02 C02-02 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.0 3.0 NaN NaN VAMC-03 41?? VA MEDICAL CTR BLDG 03 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 4.0 NaN NaN VAMC-04 4162 VA MEDICAL CTR BLDG 04 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5.0 5.0 NaN NaN VAMC-05 4164 VA MEDICAL CTR BLDG 05 C02-01 MDF 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6.0 6.0 NaN NaN VAMC-08 41?? VA MEDICAL CTR BLDG 08 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7.0 7.0 NaN NaN VAMC-08.5 41?? VA MEDICAL CTR BLDG 08.5 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 8.0 8.0 NaN NaN VAMC-12A 41?? VA MEDICAL CTR BLDG 12A C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 9.0 9.0 NaN NaN VAMC-34 4020 VA MEDICAL CTR BLDG 34 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 10.0 10.0 NaN NaN VAMC-35 41?? VA MEDICAL CTR BLDG 35 C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 11.0 11.0 NaN NaN VAMC-35A 4165 VA MEDICAL CTR BLDG 35A C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 12.0 12.0 NaN NaN VAMC-36 41?? VA MEDICAL CTR BLDG 36 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 NaN NaN NaN NaN NaN NaN NaN NaN 28 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN

29 rows × 63 columns

The column names should be converted the names of the fields used by the relational database.

In [57]:
sql_table_fields = [
    'index',
    'priority',
    'team',
    'project_manager',
    'building_code',
    'building_id',
    'building_description',
    'cluster_number',
    'number_closets',
    'legacy_switch_count',
    'ngn_switch_count',
    'aps',
    'assigned_building_vn',
    'cutover_type',
    'known_special_networks',
    'fabric_site_designation',
    'wired_migration_date',
    'wireless_migration_date',
    'stage_1_complete',
    'stage_2_3_complete',
    'building_complete',
    'program_building_communication',
    'switch_discovery',
    'extension_discovery',
    'inter_fiber',
    'building_coms_closet',
    'closet_surveys',
    'bldg_clstr_design',
    'ip_discovery',
    'speciality_vn',
    'firewall_review',
    'wap_predictive_review',
    'business_questionnaire',
    'building_comms_ddns',
    'ddns',
    'ip_av_printer_remediation',
    'walkthrough_intrabuilding',
    'bom_process',
    'closet_remediation',
    'switch_hardware_staging',
    'logistics',
    'switch_rack_stack',
    'lan_automation',
    'building_comms',
    'wap_prep',
    'qa_testing',
    'printers_ngn',
    'building_comms_app_testing',
    'specialty_devices',
    'user_specific_testing',
    'safety_vn',
    'bldgsys_vn',
    'sciencenet_ngn',
    'healthcare_vn',
    'voice_vn',
    'pci_vn',
    'building_comms_wired',
    'full_building_wired_cutover_ngn',
    'building_comms_wireless',
    'wireless_cutover_ngn',
    'legacy_cable_removal',
    'building_comms_completion',
    'lessons_learned']

Construct a lookup dictionary of flattened_column_names to sql_table_fields

In [58]:
rename_map = dict(zip(flattened_column_names, sql_table_fields))
In [59]:
rename_map
Out[59]:
{'INDEX VALUE': 'index',
 'Priority': 'priority',
 'Team': 'team',
 'Project Mgr.': 'project_manager',
 'Building Code': 'building_code',
 'Building ID': 'building_id',
 'Building Description': 'building_description',
 'Cluster #': 'cluster_number',
 '# closets': 'number_closets',
 'Legacy Switch Count': 'legacy_switch_count',
 'NGN Switch Count': 'ngn_switch_count',
 "AP's": 'aps',
 'Assigned General Building VN': 'assigned_building_vn',
 'Cutover Type': 'cutover_type',
 'Known special use networks in building': 'known_special_networks',
 'Fabric Site Designation': 'fabric_site_designation',
 'Wired Migration Date\n(tentative)': 'wired_migration_date',
 'Wireless Migration Date': 'wireless_migration_date',
 'STAGE 1': 'stage_1_complete',
 'STAGE 2/3': 'stage_2_3_complete',
 'Building': 'building_complete',
 'Program and Building Communication': 'program_building_communication',
 'Switch Discovery': 'switch_discovery',
 'Extension Discovery & Design': 'extension_discovery',
 'Inter-Fiber Discovery & Deployment': 'inter_fiber',
 'Building Comms Closet Communication': 'building_coms_closet',
 'Closet Surveys': 'closet_surveys',
 'Bldg/Clstr -InterConnected Design': 'bldg_clstr_design',
 'IP Discovery': 'ip_discovery',
 'Speciality VN Device Discovery': 'speciality_vn',
 'Firewall Review': 'firewall_review',
 'WAP Predictive Review': 'wap_predictive_review',
 'Business Questionnaire': 'business_questionnaire',
 'Building Comms DDNS Awareness': 'building_comms_ddns',
 'DDNS': 'ddns',
 'IP, AV, & Printer Remediation': 'ip_av_printer_remediation',
 'Walkthrough & Intrabuilding Architecture Design': 'walkthrough_intrabuilding',
 'BOM Process': 'bom_process',
 'Closet Remediation': 'closet_remediation',
 'Switch Hardware Staging': 'switch_hardware_staging',
 'Logistics': 'logistics',
 'Switch Rack & Stack': 'switch_rack_stack',
 'LAN Automation (Fabric Connection)': 'lan_automation',
 'Building Communications': 'building_comms',
 'WAP Prep': 'wap_prep',
 'QA Testing': 'qa_testing',
 'Printers to NGN': 'printers_ngn',
 'Building Comms - User Specific App Testing Awareness - Building Ops Awareness': 'building_comms_app_testing',
 'Specialty Devices to NGN': 'specialty_devices',
 'User Specific Testing': 'user_specific_testing',
 'Safety_VN': 'safety_vn',
 'BLDGSYS_VN': 'bldgsys_vn',
 'ScienceNet to NGN': 'sciencenet_ngn',
 'Healthcare_VN': 'healthcare_vn',
 'Voice_VN': 'voice_vn',
 'PCI_VN': 'pci_vn',
 'Building Communication Wired ': 'building_comms_wired',
 'Full Building Wired Network Services Cutover to NGN': 'full_building_wired_cutover_ngn',
 'Building Comms Wireless ': 'building_comms_wireless',
 'Wireless Cutover to NGN': 'wireless_cutover_ngn',
 'Legacy Cable & Equipment Removal': 'legacy_cable_removal',
 'Building Comms Completion': 'building_comms_completion',
 'Lessons Learned': 'lessons_learned'}

Rename the dataframe fields

In [60]:
df.rename(columns=rename_map, inplace=True)
In [61]:
df
Out[61]:
index priority team project_manager building_code building_id building_description cluster_number number_closets legacy_switch_count ... healthcare_vn voice_vn pci_vn building_comms_wired full_building_wired_cutover_ngn building_comms_wireless wireless_cutover_ngn legacy_cable_removal building_comms_completion lessons_learned
0 1.0 1.0 NaN NaN VAMC-01 4160 VA MEDICAL CTR BLDG 01 C02-02 MDF 9 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2.0 2.0 NaN NaN VAMC-02 4161 VA MEDICAL CTR BLDG 02 C02-02 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.0 3.0 NaN NaN VAMC-03 41?? VA MEDICAL CTR BLDG 03 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 4.0 NaN NaN VAMC-04 4162 VA MEDICAL CTR BLDG 04 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5.0 5.0 NaN NaN VAMC-05 4164 VA MEDICAL CTR BLDG 05 C02-01 MDF 4 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6.0 6.0 NaN NaN VAMC-08 41?? VA MEDICAL CTR BLDG 08 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7.0 7.0 NaN NaN VAMC-08.5 41?? VA MEDICAL CTR BLDG 08.5 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 8.0 8.0 NaN NaN VAMC-12A 41?? VA MEDICAL CTR BLDG 12A C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 9.0 9.0 NaN NaN VAMC-34 4020 VA MEDICAL CTR BLDG 34 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 10.0 10.0 NaN NaN VAMC-35 41?? VA MEDICAL CTR BLDG 35 C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 11.0 11.0 NaN NaN VAMC-35A 4165 VA MEDICAL CTR BLDG 35A C02-01 2 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 12.0 12.0 NaN NaN VAMC-36 41?? VA MEDICAL CTR BLDG 36 C02-01 1 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 NaN NaN NaN NaN NaN NaN NaN NaN 28 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 12.0 NaN NaN NaN NaN NaN

29 rows × 63 columns

Further Cleanup¶

  • Replace new line characters with a single space
  • Remove rows that begin with NaN
  • Remove index column; (Database will recreate on import to table)
  • Add archive_date column with current date to note date of import into table
  • Set data string data type for columns with strings and standardize string values
  • Set columns that are not string to float

Conclusion¶

  • We've only scratched the surface of what python can do for data analysis.

  • JupyterLab lends itself well to exploratory data analysis.

  • Online Resources:

    • Numpy (https://numpy.org)
    • Matplotlib (https://matplotlib.org)
    • HEP-specific tools: SciKit-HEP (https://scikit-hep.org)
    • The repository for this tutorial: https://github.com/WrightLaboratory/data_analysis
    • Slides can be viewed here: https://wrightlaboratory.github.io/data_analysis/#/
    • Live notebook environment here: https://mybinder.org/v2/gh/WrightLaboratory/data_analysis/HEAD
  • Further reading (O'Reilly Books available through Yale Library):

    • Data Analysis with Python by Wes McKinney
    • High Performance Python