Dr. Priynaga D. Talagala
IASSL Workshop on Data visualization with R and
25-5-2023
To install Python libraries, we use pip command on the command line console of the Operating System. In Jupyter, the console commands can be executed by the ‘!’ sign before the command within the cell. It is recommended to use sys library in Python which will return the path of the current version’s pip on which the jupyter is running.
Syntax:
import sys
!{sys.executable} -m pip install [package_name]
By the above code, the package will be installed in the same Python version on which the jupyter notebook is running.
# This is a comment
# import sys
# !{sys.executable} -m pip install pandas
# !{sys.executable} -m pip install palmerpenguins
# !{sys.executable} -m pip install matplotlib
# !{sys.executable} -m pip install seaborn
Once a library is installed, import it in to your application by adding the import
module statement
import [package_name]
Typying package_name.foo
in your code can be tedious. Tedium can be minimized by using
import [package_name] as [pkg]
then typing pkg.foo
.
Or
from [package_name] import [foo]
Here to use another item from the module, you have to update your import
statement.
Let's start by importing Pandas, which is a great library for managing relational (i.e. table-format) datasets:
# Pandas for managing datasets
# By convention, it is imported with the shorthand pd.
import pandas as pd
Next, we'll import Matplotlib, which will help us customize our plots further. It provides simple codes to visualize complex statistical plots, which also happen to be aesthetically pleasing. Further, Seaborn was built on top of Matplotlib, meaning it can be further powered up with Matplotlib functionalities.
# Matplotlib for additional customization
# Matplotlib is the whole package; pylab is a module in matplotlib.
# By convention, it is imported with the shorthand plt.
from matplotlib import pyplot as plt
# Alternative way
# import matplotlib.pyplot as plt
# It is generally customary to use `import matplotlib.pyplot as plt`
# and suggested in the matplotlib documentation.
Then, we'll import the Seaborn library. Seaborn is a data visualization library built on top of matplotlib and closely integrated with pandas data structures in Python.
# Seaborn for plotting and styling
# By convention, it is imported with the shorthand sns.
import seaborn as sns
Tip: we gave each of our imported libraries an alias. Later, we can invoke Pandas with pd, Matplotlib with plt, and Seaborn with sns.
Now we're ready to import our dataset. You can import your CSV file using Pandas.
# Import dataset
df = pd.read_csv("data/titanic.csv")
Here's what the dataset looks like:
# Display first 5 observations
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# summary statistics
df.Age.describe()
count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
# Tip :
# to get the help page
# help(df.describe)
In Python, the dot (.) is primarily used as a separator to access attributes and methods of objects.
This syntax is known as dot notation or dot operator.
When you have an object, such as a variable or an instance of a class, you can use the dot notation to access its attributes and methods.
For example, if you have a DataFrame named my_df
, you can use dot notation to access various methods and attributes associated with DataFrames.
For instance, you can use
my_df.head()
to retrieve the first few rows of the DataFramemy_df.shape
to obtain the dimensions of the DataFramemy_df.columns
to access the column namesThe dot operator allows you to specify the DataFrame on which you want to perform actions or retrieve values, enabling you to manipulate and analyze the data effectively.
Workign with packages
math.sqrt()
.NOtE:
pd.read_csv("data/titanic.csv")
to the variable df
, you create a DataFrame object that holds the data from the CSV file.df
, becomes the gateway to accessing various pandas capabilities for data analysis and manipulation.