Exploratory Data Analysis: Using R and Python
Early in my data science journey, I was inspired by a post which created a roadmap of things I could learn to become proficient. Below I will attempt to follow a similar exploratory path using R and Python.
Before diving into complex models, the first step in our plant data science odyssey is exploratory data analysis (EDA), a crucial process that allows us to understand the basic structure of our plant data, identify anomalies, and test hypotheses at a glance. Here, we'll use R and Python to scrutinize our Raspberry Pi sensor data, setting the stage for deeper analysis.
Before we begin: account connection setup
The MariaDB database on the Raspberry Pi will need a user setup with the IP address of the PC from which you are doing the analysis. With your IP address in available to you, use terminal to connect to the Raspberry Pi (as we have previously) then connect to the database (as we have previously). Once connected to the database, enter the following command:
CREATE USER 'exampleuser'@'192.168.1.9999' IDENTIFIED BY 'password';
Note, you will need to replace the example IP address listed of 192.168.1.9999 to your PC IP address. The above creates the user ‘exampleuser’ at your IP address with the password of ‘password’. You should be more thoughtful and think of something in line with your project. Enter the following command to give the user privaleges to access and update the database:
GRANT ALL PRIVILEGES ON SensorDataDB.* TO 'exampleuser'@'192.168.1.9999'; FLUSH PRIVILEGES;
You can check to confirm that the correct username and IP address have been added to the Host table by entering the following command:
SELECT User, Host FROM mysql.user
If your username and IP address correctly appear, you can now remotely query data from the Raspberry Pi to your PC.
retrieving sensor data with R and python
R and Python follow similar patterns to connect and retrieve data from the Raspberry Pi database. With your credentials, a query, and a variable we can pull and store all the data we need to complete our analysis.
retrieving sensor data with python
For our plant monitoring project, timely and efficient data retrieval is just as crucial as the subsequent analysis. In this section, we'll cover how to extract the sensor data stored in our MariaDB database using Python. We'll establish a database connection and execute a SQL query to fetch the data we need using libraries Python provides for interacting with databases. For MariaDB, we can use the mysql.connector
library, which allows us to connect to the database and execute SQL queries seamlessly. Below is a brief example:
import mysql.connector import pandas as pd # Database connection details db_host = '192.168.1.188' # Raspberry Pi's IP address db_user = 'Orion' db_password = 'GardenOfEden' db_name = 'SensorDataDB' # Establish a database connection conn = mysql.connector.connect( host=db_host, user=db_user, passwd=db_password, database=db_name ) # SQL query to execute query = """ SELECT sm.* FROM sensorMeasurements sm INNER JOIN sensorData sd ON sm.sensorOI = sd.sensorOI LEFT JOIN plantData p ON sm.sensorOI = p.sensorOI ORDER BY timestamp DESC, sensorOI; """ # Execute the query and fetch the results into a DataFrame df = pd.read_sql_query(query, conn) # Close the database connection conn.close() # Display the first few rows of the DataFrame print(df.head())
retrieving sensor data with R
Similar to Python, we'll establish a database connection and execute a SQL query to fetch the data we need using several libraries R provides for interacting with databases. For MariaDB, we can use the RMariaDB
library, which allows us to connect to the database and execute SQL queries seamlessly. Below is a brief example:
# Load the RMariaDB library library(RMariaDB) # Database connection details db_host <- '192.168.1.9999' # Raspberry Pi's IP address db_user <- 'exampleuser' db_password <- 'password' db_name <- 'SensorDataDB' # Create a connection object conn <- dbConnect(MariaDB(), user=db_user, password=db_password, dbname=db_name, host=db_host) # SQL query to execute query <- "SELECT sm.* FROM sensorMeasurements sm INNER JOIN sensorData sd ON sm.sensorOI = sd.sensorOI LEFT JOIN plantData p ON sm.sensorOI = p.sensorOI order by timestamp desc, sensoroi;" # Execute the query and fetch the results df <- dbGetQuery(conn, query) # Print the result to the console print(df)
EDA on Sensor Data with R and Python
With our data now extracted from the Raspberry Pi into the df table, we can proceed with exploratory data analysis to derive some insights.
EDA with Python
Python offers a plethora of libraries for data analysis, with Pandas and Seaborn being the front runners for EDA. Let's start by loading our data and taking a peek at the summary statistics:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Convert timestamp to datetime for easier manipulation df['timestamp'] = pd.to_datetime(df['timestamp']) # Summary statistics for each sensor print(df.groupby('sensorOI')['sensorRead'].describe()) # Visualize the distribution of readings for each sensor sns.set(style="whitegrid") plt.figure(figsize=(12, 6)) sns.boxplot(x='sensorOI', y='sensorRead', data=df) plt.title('Distribution of Sensor Readings') plt.show()
This code block loads our plant sensor data, computes summary statistics, and visualizes the moisture levels' distribution, providing us with a first look at our data's nature and shape.
EDA with R
R, revered for its statistical prowess, makes EDA a breeze with its dplyr
and ggplot2
packages. Here's how we can perform a similar analysis in R:
library(dplyr) library(ggplot2) # Convert timestamp to POSIXct for easier manipulation df$timestamp <- as.POSIXct(data$timestamp, format="%Y-%m-%d %H:%M:%S") df %>% group_by(sensorOI) %>% summarise( mean = mean(sensorRead), sd = sd(sensorRead), median = median(sensorRead), min = min(sensorRead), max = max(sensorRead), Q1 = quantile(sensorRead, 0.25), Q3 = quantile(sensorRead, 0.75), n = n() ) # Visualize the distribution of readings for each sensor ggplot(df, aes(x=factor(sensorOI), y=sensorRead)) + geom_boxplot() + labs(title = "Distribution of Sensor Readings", x = "Sensor ID", y = "Sensor Reading") + theme_minimal()