askill
python-data-reviewer

python-data-reviewerSafety 100Repository

WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer

1 stars
1.2k downloads
Updated 12/29/2025

Package Files

Loading files...
SKILL.md

Python Data Reviewer Skill

Purpose

Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.

When to Use

  • Pandas/NumPy code review
  • Data processing optimization
  • Memory efficiency check
  • "Why is my data code slow?"
  • ETL pipeline review

Project Detection

  • pandas, numpy in requirements.txt
  • .ipynb Jupyter notebooks
  • data/, notebooks/ directories
  • DataFrame operations in code

Workflow

Step 1: Analyze Project

**Pandas**: 2.0+
**NumPy**: 1.24+
**Other**: polars, dask, vaex
**Visualization**: matplotlib, seaborn, plotly
**ML**: scikit-learn, xgboost

Step 2: Select Review Areas

AskUserQuestion:

"Which areas to review?"
Options:
- Full data code review (recommended)
- Vectorization and performance
- Memory optimization
- Data validation
- Code organization
multiSelect: true

Detection Rules

Vectorization

CheckRecommendationSeverity
iterrows() loopUse vectorized operationsCRITICAL
apply() with simple funcUse built-in vectorizedHIGH
Manual loop over arrayUse NumPy broadcastingHIGH
List comprehension on SeriesUse .map() or vectorizeMEDIUM
# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
    df.loc[idx, "total"] = row["price"] * row["quantity"]

# GOOD: Vectorized operation
df["total"] = df["price"] * df["quantity"]

# BAD: apply with simple operation
df["upper_name"] = df["name"].apply(lambda x: x.upper())

# GOOD: Built-in string method
df["upper_name"] = df["name"].str.upper()

# BAD: apply with condition
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")

# GOOD: np.where or np.select
df["status"] = np.where(df["score"] >= 60, "pass", "fail")

# Multiple conditions
conditions = [
    df["score"] >= 90,
    df["score"] >= 60,
    df["score"] < 60,
]
choices = ["A", "B", "F"]
df["grade"] = np.select(conditions, choices)

Memory Optimization

CheckRecommendationSeverity
int64 for small intsUse int8/int16/int32MEDIUM
object dtype for categoriesUse category dtypeHIGH
Loading full fileUse chunks or usecolsHIGH
Keeping unused columnsDrop earlyMEDIUM
# BAD: Default dtypes waste memory
df = pd.read_csv("large_file.csv")  # All int64, object

# GOOD: Specify dtypes
dtype_map = {
    "id": "int32",
    "age": "int8",
    "status": "category",
    "price": "float32",
}
df = pd.read_csv("large_file.csv", dtype=dtype_map)

# GOOD: Load only needed columns
df = pd.read_csv(
    "large_file.csv",
    usecols=["id", "name", "price"],
    dtype=dtype_map,
)

# GOOD: Process in chunks
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
result = pd.concat([process_chunk(chunk) for chunk in chunks])

# Memory check
print(df.info(memory_usage="deep"))

# Convert object to category if low cardinality
for col in df.select_dtypes(include=["object"]).columns:
    if df[col].nunique() / len(df) < 0.5:  # < 50% unique
        df[col] = df[col].astype("category")

Data Validation

CheckRecommendationSeverity
No null checkValidate nulls earlyHIGH
No dtype validationAssert expected typesMEDIUM
No range validationCheck value boundsMEDIUM
Silent data issuesRaise or log warningsHIGH
# GOOD: Data validation function
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Validate and clean input DataFrame."""

    # Required columns
    required_cols = ["id", "name", "price", "quantity"]
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")

    # Null checks
    null_counts = df[required_cols].isnull().sum()
    if null_counts.any():
        logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")

    # Type validation
    assert df["id"].dtype in ["int32", "int64"], "id must be integer"
    assert df["price"].dtype in ["float32", "float64"], "price must be float"

    # Range validation
    invalid_prices = df[df["price"] < 0]
    if len(invalid_prices) > 0:
        logger.warning(f"Found {len(invalid_prices)} negative prices")
        df = df[df["price"] >= 0]

    # Duplicate check
    duplicates = df.duplicated(subset=["id"])
    if duplicates.any():
        logger.warning(f"Removing {duplicates.sum()} duplicates")
        df = df.drop_duplicates(subset=["id"])

    return df

NumPy Patterns

CheckRecommendationSeverity
Python loop over arrayUse broadcastingCRITICAL
np.append in loopPre-allocate arrayHIGH
Repeated array creationReuse arraysMEDIUM
Copy when not neededUse viewsMEDIUM
# BAD: Python loop
result = []
for i in range(len(arr)):
    result.append(arr[i] * 2 + 1)
result = np.array(result)

# GOOD: Vectorized
result = arr * 2 + 1

# BAD: np.append in loop (creates new array each time)
result = np.array([])
for x in data:
    result = np.append(result, process(x))

# GOOD: Pre-allocate
result = np.empty(len(data))
for i, x in enumerate(data):
    result[i] = process(x)

# BETTER: Vectorize if possible
result = np.vectorize(process)(data)

# BEST: Pure NumPy operation
result = np.sqrt(data) + np.log(data)

# BAD: Unnecessary copy
subset = arr[arr > 0].copy()  # copy() often not needed

# GOOD: View when modification not needed
subset = arr[arr > 0]  # Returns view

Pandas Best Practices

CheckRecommendationSeverity
Chained indexingUse .loc/.ilocHIGH
inplace=TrueAssign result insteadMEDIUM
reset_index() abuseKeep index when usefulLOW
df.append (deprecated)Use pd.concatHIGH
# BAD: Chained indexing (unpredictable)
df[df["price"] > 100]["status"] = "premium"  # May not work!

# GOOD: Use .loc
df.loc[df["price"] > 100, "status"] = "premium"

# BAD: df.append (deprecated)
result = pd.DataFrame()
for chunk in chunks:
    result = result.append(chunk)

# GOOD: pd.concat
result = pd.concat(chunks, ignore_index=True)

# BAD: Multiple operations creating copies
df = df.dropna()
df = df.reset_index(drop=True)
df = df.sort_values("date")

# GOOD: Method chaining
df = (
    df
    .dropna()
    .reset_index(drop=True)
    .sort_values("date")
)

Response Template

## Python Data Code Review Results

**Project**: [name]
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows

### Vectorization
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |

### Memory
| Status | File | Issue |
|--------|------|-------|
| HIGH | load.py:23 | object dtype for 'status' - use category |

### Data Validation
| Status | File | Issue |
|--------|------|-------|
| HIGH | etl.py:67 | No null value handling |

### NumPy
| Status | File | Issue |
|--------|------|-------|
| HIGH | calc.py:12 | np.append in loop - pre-allocate |

### Recommended Actions
1. [ ] Replace iterrows with vectorized operations
2. [ ] Convert low-cardinality columns to category
3. [ ] Add data validation before processing
4. [ ] Pre-allocate NumPy arrays in loops

Performance Tips

# Profile memory usage
df.info(memory_usage="deep")

# Profile time
%timeit df.apply(func)  # Jupyter magic
%timeit df["col"].map(func)

# Use query() for complex filters
df.query("price > 100 and category == 'A'")

# Use eval() for complex expressions
df.eval("profit = revenue - cost")

Best Practices

  1. Vectorize: Avoid loops, use broadcasting
  2. Dtypes: Use smallest sufficient type
  3. Validate: Check data quality early
  4. Profile: Measure before optimizing
  5. Chunk: Process large files incrementally

Integration

  • python-reviewer: General Python patterns
  • perf-analyzer: Performance profiling
  • coverage-analyzer: Test coverage for data code

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

92/100Analyzed 2/19/2026

High-quality technical reference skill for Pandas/NumPy code review with comprehensive detection rules, code examples, and response templates. Well-organized with clear sections covering vectorization, memory optimization, data validation, and best practices. Slight mismatch between tags (ci-cd, github-actions) and actual content (data code review). Located in dedicated skills folder indicating good structure.

100
95
85
90
95

Metadata

Licenseunknown
Version-
Updated12/29/2025
Publisherphysics91

Tags

ci-cdgithub-actionstesting