15. Data Cleaning with Pandas

Chapter 15 of 36 · 20 min

Common Cleaning Tasks

AI data pipelines require cleaning: missing values, duplicates, type conversions, outliers.

Missing Values

# Create sample data with missing values
data = pd.DataFrame({
    "text": ["Hello", "World", None, "AI", None],
    "score": [0.8, None, 0.5, 0.9, 0.7],
    "category": ["A", "B", "A", None, "B"]
})

# Check missing
print(data.isna().sum())

# Drop rows with any missing
clean = data.dropna()

# Drop rows missing specific columns
clean = data.dropna(subset=["text"])

# Fill missing values
data["score"].fillna(data["score"].median(), inplace=True)
data["category"].fillna("unknown", inplace=True)

Removing Duplicates

data = pd.DataFrame({
    "text": ["Hello", "Hello", "World", "AI"],
    "id": [1, 1, 2, 3]
})

print(data.duplicated())       # [False, True, False, False]
clean = data.drop_duplicates()

Type Conversion

# Check types
print(data.dtypes)

# Convert
data["score"] = pd.to_numeric(data["score"], errors="coerce")
data["text"] = data["text"].astype(str)

String Cleaning

df = pd.DataFrame({"text": ["  Hello  ", "WORLD!", "ai..."]})

df["text"] = df["text"].str.lower()
df["text"] = df["text"].str.strip()
df["text"] = df["text"].str.replace(r"[^\w\s]", "", regex=True)

print(df["text"])  # ["hello", "world", "ai"]

Chaining Operations

clean_df = (
    df.dropna(subset=["text"])
    .drop_duplicates()
    .assign(
        text=lambda x: x["text"].str.lower().str.strip(),
        length=lambda x: x["text"].str.len()
    )
)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Given a DataFrame with columns ["name", "email", "score", "category"] containing missing values, duplicates, and mixed-case strings, clean it: drop missing emails, remove duplicate names, normalize text columns.