Introduction:
Data analysis is a crucial step in the process of making informed decisions based on data. It involves the transformation of raw data into a format that is suitable for analysis. Python has emerged as one of the most popular languages for data analysis due to its powerful libraries and easy-to-learn syntax. In this article, we will focus on three of the most popular Python libraries for data analysis: NumPy, Pandas, and Matplotlib.
NumPy
NumPy (Numerical Python) is one of the Python libraries for data analysis that is open source and utilized in practically every discipline of science and engineering. It is the universal standard in Python for handling numerical data. Users of NumPy range from novice programmers to experienced researchers working on cutting-edge scientific and industrial research and development. The majority of other Python data science and scientific programs, including Pandas, SciPy, Matplotlib, scikit-learn, and scikit-image, make significant use of the NumPy API. Its core object is the ndarray (n-dimensional array), which is a homogeneous container for storing numerical data. NumPy arrays are more efficient than traditional Python lists because they are homogeneous, meaning that they only store elements of the same type. This allows for fast and memory -efficient computations on large datasets.
Creating Numpy Arrays
Numpy’s core object is the ndarray (n-dimensional array), which is a homogeneous container for storing numerical data. NumPy arrays are more efficient than traditional Python lists because they are homogeneous, meaning that they only store elements of the same type. This allows for fast and memory-efficient computations on large datasets.
Here’s an example of creating a simple 1-dimensional Numpy array:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
print(a)
Output:
[1 2 3 4 5]
In this example, we’ve created a 1-dimensional Numpy array of integers using the np.array() function. We’ve passed in a list of integers, and Numpy has automatically created an array with the same number of elements.
We can also create multi-dimensional arrays using the np.array() function. Here’s an example of creating a 2-dimensional Numpy array:
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b)
Output:
[[1 2 3]
[4 5 6]]
In this example, we’ve created a 2-dimensional Numpy array of integers using the np.array() function. We’ve passed in a list of lists, and Numpy has automatically created a 2-dimensional array with the same number of rows and columns.
Numpy also provides many functions for creating arrays with specific properties. For example, we can create an array of zeros or ones using the np.zeros() and np.ones() functions, respectively:
c = np.zeros((3, 4))
print(c)
Output:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
In this example, we’ve created a 2-dimensional array of zeros with 3 rows and 4 columns using the np.zeros() function.
Indexing Numpy Arrays
We can access individual elements of a Numpy array using indexing. Here’s an example of accessing the second element of the a array:
print(a[1])
Ouput:
2
In this example, we’ve accessed the second element of the ‘a’ array using indexing. Note that indexing in Numpy arrays starts from 0.
Slicing Numpy Arrays
We can also use slicing to access a subset of the elements of a Numpy array. Here’s an example of accessing the first three elements of the a array:
print(a[:3])
Output:
[1 2 3]
In this example, we’ve used slicing to access the first three elements of the a array. The : operator specifies the range of indices to include, and the first index is inclusive while the last index is exclusive.
Reshaping Numpy Arrays
We can reshape Numpy arrays using the np.reshape() function. Here’s an example of reshaping the a array into a 2-dimensional array with 5 rows and 1 column:
a_reshaped = np.reshape(a, (5, 1))
print(a_reshaped)
Output:
[[1]
[2]
[3]
[4]
[5]]
In this example, we’ve used the np.reshape() function to reshape the a array into a 2-dimensional array with 5 rows and 1 column.
Pandas:
Pandas is another among the three Python libraries for data analysis in this article. One of the most well-known Python libraries for data analysis, pandas was created by Wes McKinney in 2008 in response to a demand for a strong and adaptable tool for quantitative analysis. It has a very vibrant contributor community.
Two essential Python libraries—NumPy for mathematical operations and Matplotlib for data visualisation—serve as the foundation upon which Pandas is built. Pandas functions as a wrapper for these libraries, allowing you to use fewer lines of code to access various Matplotlib and NumPy methods. For instance, the.plot() function in pandas combines several matplotlib methods into one method, allowing you to plot a chart in fewer lines of code.
Pandas is a prominent open-source data manipulation library that is widely utilized in data analysis and data science projects. It offers user-friendly data structures and data analysis tools that let users manage and work with enormous data sets effectively. We’ll go over some of the most important Pandas features in the next section along with programming examples.
Data Structures
The two main data structures offered by the Python library for data analysis, Pandas are Series and DataFrame.
A Series is a one-dimensional array-like object that can contain any form of data. It is similar to a spreadsheet column or a SQL table. A series can be built by giving a list or array of values to the Series constructor, as demonstrated below:
import pandas as pd
# Creating a Series from a list
my_list = [10, 20, 30, 40, 50]
my_series = pd.Series(my_list)
print(my_series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
A two-dimensional tabular data structure with labelled axes is known as a DataFrame. (rows and columns). It resembles a spreadsheet or a SQL table. CSV files, Excel files, SQL databases, and Python dictionaries are just a few of the sources that can be used for creating DataFrames. A DataFrame can be created from a dictionary using the example below:
# Creating a DataFrame from a dictionary
my_dict = {‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Emily’],
‘age’: [25, 32, 18, 47, 29],
‘city’: [‘New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’, ‘Miami’]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:
name | age | city | |
0 | Alice | 25 | New York |
1 | Bob | 32 | Los Angeles |
2 | Charlie | 18 | Chicago |
3 | David | 47 | Houston |
4 | Emily | 29 | Miami |
Data Manipulation
Data manipulation tools like filtering, grouping, and aggregation are all powerfully offered by Pandas.
Filtering
Boolean indexing can be used to filter DataFrames. The example that follows demonstrates how to filter a DataFrame to only choose rows where the age is more than 30:
# Filtering a DataFrame
my_filtered_df = my_df[my_df[‘age’] > 30]
print(my_filtered_df)
Output:
name | age | city | |
1 | Bob | 32 | Los Angeles |
3 | David | 47 | Houston |
4 | Emily | 29 | Miami |
Grouping
Using the groupby() method, DataFrames can be grouped by one or more columns. The example below demonstrates how to group a DataFrame by city and determine the average age for each city:
# Grouping a DataFrame
my_grouped_df = my_df.groupby(‘city’).mean()
print(my_grouped_df)
Output:
City | Age |
Chicago | 18.0 |
Houston | 47.0 |
Los Angeles | 32.0 |
Miami | 29.0 |
New York | 25.0 |
Aggregation
The agg() method is used to aggregate DataFrames. For each city, the lowest, maximum, and average ages can be determined using the example below:
# Aggregating a DataFrame
my_agg_df = my_df.groupby(‘city’).agg({‘age’: [‘min’, ‘max’, ‘mean’]})
print(my_agg_df)
Output:
age | city |
Chicago | min max mean |
Matplotlib:
For data visualization in data science and research, a well-known tool which is mostly used among the Python libraries for data analysis is Matplotlib. You can generate an extensive range of plots and charts with Matplotlib, such as line charts, scatter plots, histograms, bar charts, and more.
Matplotlib must first be installed using ‘pip’ before you can use it. The following command can be used to import it into your Python code once it has been installed:
import matplotlib.pyplot as plt
This imports the Matplotlib ‘pyplot’ module and gives it the alias ‘plt’, which is frequently used in Matplotlib code.
Let’s begin by looking at an easy Matplotlib line chart example. Consider that we have two data arrays, ‘x’ and ‘y’, which represent the x and y values for the line chart. Here is how the chart can be created:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.show()
This code first constructs the arrays ‘x’ and ‘y’, then uses the ‘plot()’ function to generate a line chart of ‘y’ against ‘x’. The ‘show()’ function displays the chart in a new window.
Let’s now have a look at a scatter plot created using Matplotlib as an example. Consider that we have two data arrays, ‘x’ and ‘y’, which represent the x and y values for the scatter plot. The plot can be created as follows:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.scatter(x, y)
plt.show()
This code generates a scatter plot of ‘y’ against ‘x’ using the ‘scatter()’ function. The resulting plot displays a number of points, each of which stands for a pair of ‘x’ and ‘y’ values.
Let’s now look at an illustration of a histogram made with Matplotlib. Assume we have a dataset represented by an array of data, x. The histogram can be created as shown below:
import matplotlib.pyplot as plt
x = [1, 2, 2, 3, 3, 3, 4, 4, 5]
plt.hist(x)
plt.show()
This code creates a histogram of the data in x using the hist() function. The frequency of values falling within each bin is depicted on the resulting graphic.
Conclusion:
NumPy, Pandas and Matplotlib are three powerful Python libraries for data analysis, providing efficient array manipulation, numerical computing, data manipulation and analysis, and high-quality visualizations to communicate results.