Chris Castiglione Co-founder of Console.xyz. Adjunct Prof at Columbia University Business School.

Jupyter Notebook: A Beginner’s Tutorial

8 min read

 

 

 

 

 

Jupyter Notebook

Introduction

Jupyter Notebook is a web-based interactive development environment for creating notebook documents, which are essentially a step-by-step incarnation of a Python program, through an interactive layout.

A notebook may serve as a reference for many different entities, mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format depending on the context. 

A Jupyter Notebook document is a JSON document, following a schema, and contains a list of input and output ordered cells that can contain code, Markdown text, mathematical expressions, plots,  charts, and media. Jupiter Notebooks utilize the .ipynb file extension.

A great aspect of a Jupyter Notebook is that it can be converted to a number of different formats, such as: HTML,  slides, PDF, Markdown and even Python through the “Download As” option available in the web interface.

A Jupyter Notebook is a great way to build step-by-step interactive Python programs. The technology is particularly well-suited for data analysis and plotting. 

Installation

To install Jupyter Notebook go to the command line (or terminal on your Mac) and run the pip install jupyter command, which we can see as follows.

Jupyter Notebook: A Beginner's Tutorial

Once the installation process has finished, we can launch Jupyter Notebook by running the command jupyter notebook, which we can also see below.

Jupyter Notebook: A Beginner's Tutorial

This command will execute Jupyter Notebook and open it in a browser window, which we can see as follows.

Jupyter Notebook: A Beginner's Tutorial

Jupyter Notebook runs in the browser, and the main screen displays a list of local folders on your machine where Jupyter files (with the the .ipynb extension) can be saved and executed from.

Creating a New Notebook

From the Jupyter Notebook main screen create a new notebook which to start developing an interactive Python solution.  You can do this by clicking on the New button, and then clicking on the Python 3 option, as we can see below.

Jupyter Notebook: A Beginner's Tutorial

This will open a new browser tab which will display the new Notebook. Let’s go to that tab to start adding some code. The new tab looks as follows.

This will open a new browser tab which will display the new Notebook. Let’s go to that tab to start adding some code. The new tab looks as follows.

Notice the section highlighted in blue, which is known as a cell. Notebooks are made up of cells. Within cells we can write snippets of code and see the results of the execution of that code, immediately. Let’s add some basic code.

This will open a new browser tab which will display the new Notebook. Let’s go to that tab to start adding some code. The new tab looks as follows.

The instruction print(2+3) has been added to the cell and then Shift+Enter has been pressed. This returned the value of 5 as the result of the operation and also added a new cell below.

We can add further code on the new cell and keep interacting with Python by pressing Shift+Enter and repeating the process, but with different snippets of code.

In essence, this is what a Jupyter Notebook is all about. It’s about being able to add snippets of code and seeing what result each snippet returns. It’s about being able to write code in an interactive way. Each snippet can go in its own cell.

So, let’s go ahead and do something really interesting, which is to analyze the content of a directory tree.

Walking and Organizing a Folder Structure

Information within a computer system is organized into folders, subfolders and files. Python is a great language to work with folders and files. It is very easy with Python to be able to loop through all the subfolders and files of a specific directory tree.

What if we could write a Python script that would be able to walk through a specific directory and retrieve the name of all the subfolders and files contained within it, and then query the search results to be able to plot a chart that summarizes these entries, by file extension?

This is what we’ll do. So, before we can think of plotting anything, we need to be able to walk the directory tree of a specific folder and gather the results of what subfolders and files exist within it.

Walking the directory tree of a specific folder is quite easy to do in Python. We can achieve this with the following code:

import os

for fn, sflds, fnames in os.walk(‘C:\\Temp’):

   print(‘Current folder is ‘ + fn)

   for sf in sflds:

       print(sf + ‘ is a subfolder of ‘ + fn)

   for fname in fnames:

       print(fname + ‘ is a file of ‘ + fn)

   print()

The first thing we do is to import os the module, which contains functions that are used for interacting with the operating system. Notice that ‘C:\\Temp’ refers to a Windows folder, however you can change this to a Unix/Linux/macOS path name without any issues. Also, ‘C://Temp’ works on Windows as well. 

Then we loop through the specific folder (in this case C:\Temp) and get the folder name (fn), the names of the subfolders (sflds) contained within fn, and the name of the files (fnames) contained within fn

The os.walk function allows us repeat this same process for all the subfolders (sflds) contained within C:\Temp. Make sure you have a folder on your machine called C:\Temp which contains a few files and other subfolders.

So let’s add this function to the Jupyter Notebook cell and then press Shift+Enter to check what results we get. We can see those results as follows (notice that these results will vary, as it will depend on the content, files and subfolders that you will have under C:\Temp).

Jupyter Notebook: A Beginner's Tutorial

Now that we are able to walk through the directory tree of the C:\Temp folder, let’s modify the code so we are able to add the folders and files as categorizable results, to be able to plot them later. The modified code looks as follows.

import os

items = []

for fn, sflds, fnames in os.walk(‘C:\\Temp’):

   item = {}

   item[‘location’] = fn

   for sf in sflds:

       item[‘name’] = sf

       item[‘type’] = ‘folder’

       items.append(item)

   for fname in fnames:

       item[‘name’] = fname

       item[‘type’] = ‘file’

       items.append(item)

for itm in items:

   print(itm)

In essence, the main code structure remains the same, notice however that we’ve removed the print statements and instead have declared an items list and also an item dictionary which will contain the information of each individual file or subfolder contained within the C:\Temp directory tree.

For each individual file or subfolder contained within the C:\Temp directory tree, we need to know what is the parent location (the parent folder or subfolder that contains it), the name of the file or subfolder and also the type of resource it is, whether it is a file or folder. This is why each item is a dictionary.

As the code walks the directory tree and loops across the subfolders and files contained within, they are added to the items list, which is what items.append does.

Finally, we can print the content of the items list by iterating through each itm within items. If we now press Shift+Enter to run the code, we should see the results organized differently. Let’s have a look at the top results.

Jupyter Notebook: A Beginner's Tutorial

Awesome, we are now able to walk through a directory tree for a specific folder and categorize each item accordingly. Now, we can look at how to plot this using the amazing Plotly library.

Installing Plotly

The first thing we need to do if we want to use the Plotly library is to install it. We can do this by running pip install plotly from the command line. Detailed instructions on how to do this can be found on the official Plotly documentation website.

Determining File Extensions

Once Plotly has been installed, we can start to use it within our code. But before we do that, wouldn’t it be great if we could know what the percentage of file types (by file extension) exist within the directory tree? 

Say we want to know how many .csv or .pptx files exist within the directory tree.  To do that, we need to slightly modify the code we have, and add the file extension as a dictionary property. Here’s how the modified code would look like.

import os

items = []

for fn, sflds, fnames in os.walk(‘C:\\Temp’):

   item = {}

   item[‘location’] = fn

   for sf in sflds:

       item[‘name’] = sf

       item[‘type’] = ‘folder’

       item[‘ext’] = ‘N/A’

       items.append(item)

   for fname in fnames:

       item[‘name’] = fname

       item[‘type’] = ‘file’

       item[‘ext’] = os.path.splitext(fname)[1]

       items.append(item)

for itm in items:

   print(itm)

The code is essentially the same. There are only two differences highlighted in bold above. Basically the item[‘ext’] = ‘N/A’ instruction indicates that a subfolder has no file extension, whereas the instruction item[‘ext’] = os.path.splitext(fname)[1] retrieves the file extension for each file that is found when walking the directory tree.

So, if we now run this code by pressing Shift+Enter, we should be able to see the list of files within the directory tree, including the file extension. Let’s have a look.

Jupyter Notebook: A Beginner's Tutorial

Awesome, the results now include file extensions, which is what we’ll be using as a main metric for our plot.

Aggregating File Extensions

We now know have the files and their extensions, we need to be able to aggregate how many instances of each file extension exist, as this will help us plot our results. 

So, let’s modify our code so we can not only return the list of files (items), but also the list of file extensions (which we will call exts), in order to aggregate the results. We’ll also convert our code to a function (by using the keyword def), to ensure reusability and modularity going forward.

def getfiles(dir):

   items = []

   exts = []

   for fn, sflds, fnames in os.walk(dir):

       item = {}

       item[‘location’] = fn

       for sf in sflds:

           item[‘name’] = sf

           item[‘type’] = ‘folder’

           item[‘ext’] = ‘N/A’

           items.append(item)

       for fname in fnames:

           item[‘name’] = fname

           item[‘type’] = ‘file’

           item[‘ext’] = os.path.splitext(fname)[1]

           if item[‘ext’].strip() != :

  exts.append(item[‘ext’])

           items.append(item)

   return (items, exts)

The changes are highlighted in bold. Basically what we have added is a new list called exts, which will contain all the file extensions found within the directory tree.

Each extension found (item[‘ext’]) is added to the exts list with the instruction exts.append(item[‘ext’]), only when a file actually has an extension, which is checked by the condition if item[‘ext’].strip() != :

Grouping and Counting File Extensions

Now that we have file extensions aggregated in their own list, we can group together extensions that are from the same file extension type, and count how many instances of each we have. We can do this as follows.

def getplotdata(exts):

   xts = dict((ext, exts.count(ext)) for ext in exts)

   labels = list(xts.keys())

   values = list(xts.values())

   return labels, values

The instruction xts = dict((ext, exts.count(ext)) for ext in exts) is what does the magic. It creates a dictionary of the extensions found on the directory tree, grouped by extension type and each with its own count. 

dict((ext, exts.count(ext)) for ext in exts) basically returns a list of file extensions grouped together (without duplicates), and for each file extension, how many instances have been found.

Because a dictionary is really a list of key-value pairs, we cannot use it for plotting. We’ll need to have the labels (which will be the list of xts.keys) and the values (which will be the list of xts.values), separately as independent lists.

This is why we create a list of keys from the file extension dictionary with the following instruction: labels = list(xts.keys())

This is also why we create a list of values from the file extension dictionary with the following instruction:  values = list(xts.values())

The list of xts.keys (labels) is nothing more than the list of file extension names, i.e. “.pdf”, “.pptx”, “.py”, whereas the list of xts.values (values) is the number of files found for each extension, i.e. “5”, “4”, “3”, within the directory tree.

In this example, there would be 5 files with the .pdf extension, 4 files with the .pptx extension and 3 files with the .py extension, contained in that directory tree. 

Plotting the Results

Now that we have grouped and counted the file extensions found within the directory tree, we are finally ready to plot the results. The first thing we need to do is to import the Plotly library, which we can do with the following statements.

import plotly.graph_objs as go

import plotly as ply

Once that has been done, we can define a function that will be used for creating the plot. Let’s have a look.

def plotres(labels, values):

   trace = go.Pie(labels=labels, values=values)

   ply.offline.plot([trace])

The function is very simple. All we do is to receive the labels and values returned by the getplotdata function, to create a pie chart and then invoke the plot function to render the results locally.

Below is the complete finished code.

import os

import plotly.graph_objs as go

import plotly as ply

def getfiles(dir):

   items = []

   exts = []

   for fn, sflds, fnames in os.walk(dir):

       item = {}

       item[‘location’] = fn

for sf in sflds:

           item[‘name’] = sf

           item[‘type’] = ‘folder’

           item[‘ext’] = ‘N/A’

           items.append(item)

       for fname in fnames:

           item[‘name’] = fname

           item[‘type’] = ‘file’

           item[‘ext’] = os.path.splitext(fname)[1]

           if item[‘ext’].strip() != :

               exts.append(item[‘ext’])

           items.append(item)

   return (items, exts)

def getplotdata(exts):

   xts = dict((ext, exts.count(ext)) for ext in exts)

   labels = list(xts.keys())

   values = list(xts.values())

   return labels, values

def plotres(labels, values):

   trace = go.Pie(labels=labels, values=values)

   ply.offline.plot([trace])

items, exts = getfiles(‘C:\\Temp’)

labels, values = getplotdata(exts)

plotres(labels, values)

If we now run the code on the Jupyter Notebook cell by pressing Shift+Enter, we’ll be able to see the results plotted in a pie chart, which in my case looks as follows.

Jupyter Notebook: A Beginner's Tutorial

Isn’t that awesome! With a few lines of code we were able to write a short script that is able to walk a directory tree and give us an overview of what types of files extensions exist within the folder structure and how many instances of each are there, expressed in terms of percentage. 

All this was done with an interactive and easy-to-use environment, which is Jupyter Notebook. 

Learn to Code Comment Avatar
Chris Castiglione Co-founder of Console.xyz. Adjunct Prof at Columbia University Business School.