15. Data Cleaning with Pandas
Common Cleaning Tasks
AI data pipelines require cleaning: missing values, duplicates, type conversions, outliers.
Missing Values
# Create sample data with missing values
data = pd.DataFrame({
"text": ["Hello", "World", None, "AI", None],
"score": [0.8, None, 0.5, 0.9, 0.7],
"category": ["A", "B", "A", None, "B"]
})
# Check missing
print(data.isna().sum())
# Drop rows with any missing
clean = data.dropna()
# Drop rows missing specific columns
clean = data.dropna(subset=["text"])
# Fill missing values
data["score"].fillna(data["score"].median(), inplace=True)
data["category"].fillna("unknown", inplace=True)
Removing Duplicates
data = pd.DataFrame({
"text": ["Hello", "Hello", "World", "AI"],
"id": [1, 1, 2, 3]
})
print(data.duplicated()) # [False, True, False, False]
clean = data.drop_duplicates()
Type Conversion
# Check types
print(data.dtypes)
# Convert
data["score"] = pd.to_numeric(data["score"], errors="coerce")
data["text"] = data["text"].astype(str)
String Cleaning
df = pd.DataFrame({"text": [" Hello ", "WORLD!", "ai..."]})
df["text"] = df["text"].str.lower()
df["text"] = df["text"].str.strip()
df["text"] = df["text"].str.replace(r"[^\w\s]", "", regex=True)
print(df["text"]) # ["hello", "world", "ai"]
Chaining Operations
clean_df = (
df.dropna(subset=["text"])
.drop_duplicates()
.assign(
text=lambda x: x["text"].str.lower().str.strip(),
length=lambda x: x["text"].str.len()
)
)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Given a DataFrame with columns ["name", "email", "score", "category"] containing missing values, duplicates, and mixed-case strings, clean it: drop missing emails, remove duplicate names, normalize text columns.