Exploratory Data Analysis: Using R and Python

Early in my data science journey, I was inspired by a post which created a roadmap of things I could learn to become proficient. Below I will attempt to follow a similar exploratory path using R and Python.

Before diving into complex models, the first step in our plant data science odyssey is exploratory data analysis (EDA), a crucial process that allows us to understand the basic structure of our plant data, identify anomalies, and test hypotheses at a glance. Here, we'll use R and Python to scrutinize our Raspberry Pi sensor data, setting the stage for deeper analysis.

Before we begin: account connection setup

The MariaDB database on the Raspberry Pi will need a user setup with the IP address of the PC from which you are doing the analysis. With your IP address in available to you, use terminal to connect to the Raspberry Pi (as we have previously) then connect to the database (as we have previously). Once connected to the database, enter the following command:

CREATE USER 'exampleuser'@'192.168.1.9999' IDENTIFIED BY 'password';

Note, you will need to replace the example IP address listed of 192.168.1.9999 to your PC IP address. The above creates the user ‘exampleuser’ at your IP address with the password of ‘password’. You should be more thoughtful and think of something in line with your project. Enter the following command to give the user privaleges to access and update the database:

GRANT ALL PRIVILEGES ON SensorDataDB.* TO 'exampleuser'@'192.168.1.9999';

FLUSH PRIVILEGES;

You can check to confirm that the correct username and IP address have been added to the Host table by entering the following command:

SELECT User, Host 
FROM mysql.user

If your username and IP address correctly appear, you can now remotely query data from the Raspberry Pi to your PC.

retrieving sensor data with R and python

R and Python follow similar patterns to connect and retrieve data from the Raspberry Pi database. With your credentials, a query, and a variable we can pull and store all the data we need to complete our analysis.

retrieving sensor data with python

For our plant monitoring project, timely and efficient data retrieval is just as crucial as the subsequent analysis. In this section, we'll cover how to extract the sensor data stored in our MariaDB database using Python. We'll establish a database connection and execute a SQL query to fetch the data we need using libraries Python provides for interacting with databases. For MariaDB, we can use the mysql.connector library, which allows us to connect to the database and execute SQL queries seamlessly. Below is a brief example:

import mysql.connector
import pandas as pd

# Database connection details
db_host = '192.168.1.188' # Raspberry Pi's IP address
db_user = 'Orion'
db_password = 'GardenOfEden'
db_name = 'SensorDataDB'

# Establish a database connection
conn = mysql.connector.connect(
    host=db_host,
    user=db_user,
    passwd=db_password,
    database=db_name
)

# SQL query to execute
query = """
SELECT sm.*
FROM sensorMeasurements sm
INNER JOIN sensorData sd ON sm.sensorOI = sd.sensorOI
LEFT JOIN plantData p ON sm.sensorOI = p.sensorOI
ORDER BY timestamp DESC, sensorOI;
"""

# Execute the query and fetch the results into a DataFrame
df = pd.read_sql_query(query, conn)

# Close the database connection
conn.close()

# Display the first few rows of the DataFrame
print(df.head())

retrieving sensor data with R

Similar to Python, we'll establish a database connection and execute a SQL query to fetch the data we need using several libraries R provides for interacting with databases. For MariaDB, we can use the RMariaDB library, which allows us to connect to the database and execute SQL queries seamlessly. Below is a brief example:

# Load the RMariaDB library
library(RMariaDB)

# Database connection details
db_host <- '192.168.1.9999' # Raspberry Pi's IP address
db_user <- 'exampleuser'
db_password <- 'password'
db_name <- 'SensorDataDB'

# Create a connection object
conn <- dbConnect(MariaDB(), user=db_user, password=db_password, dbname=db_name, host=db_host)

# SQL query to execute
query <- "SELECT sm.*
FROM sensorMeasurements sm
INNER JOIN sensorData sd ON sm.sensorOI = sd.sensorOI
LEFT JOIN plantData p ON sm.sensorOI = p.sensorOI
order by timestamp desc, sensoroi;"

# Execute the query and fetch the results
df <- dbGetQuery(conn, query)

# Print the result to the console
print(df)

EDA on Sensor Data with R and Python

With our data now extracted from the Raspberry Pi into the df table, we can proceed with exploratory data analysis to derive some insights.

EDA with Python

Python offers a plethora of libraries for data analysis, with Pandas and Seaborn being the front runners for EDA. Let's start by loading our data and taking a peek at the summary statistics:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Convert timestamp to datetime for easier manipulation
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Summary statistics for each sensor
print(df.groupby('sensorOI')['sensorRead'].describe())

# Visualize the distribution of readings for each sensor
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.boxplot(x='sensorOI', y='sensorRead', data=df)
plt.title('Distribution of Sensor Readings')
plt.show()

This code block loads our plant sensor data, computes summary statistics, and visualizes the moisture levels' distribution, providing us with a first look at our data's nature and shape.

EDA with R

R, revered for its statistical prowess, makes EDA a breeze with its dplyr and ggplot2 packages. Here's how we can perform a similar analysis in R:

library(dplyr)
library(ggplot2)

# Convert timestamp to POSIXct for easier manipulation
df$timestamp <- as.POSIXct(data$timestamp, format="%Y-%m-%d %H:%M:%S")

df %>%
  group_by(sensorOI) %>%
  summarise(
    mean = mean(sensorRead),
    sd = sd(sensorRead),
    median = median(sensorRead),
    min = min(sensorRead),
    max = max(sensorRead),
    Q1 = quantile(sensorRead, 0.25),
    Q3 = quantile(sensorRead, 0.75),
    n = n()
  )

# Visualize the distribution of readings for each sensor
ggplot(df, aes(x=factor(sensorOI), y=sensorRead)) +
  geom_boxplot() +
  labs(title = "Distribution of Sensor Readings", x = "Sensor ID", y = "Sensor Reading") +
  theme_minimal()
Next
Next

Automation: Watering the Plants