spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Quick Start
Installation
Text Processing
Training Classifiers
Troubleshooting
Production Deployment

Scope

In Scope:

spaCy 3.x installation and text processing
TextCategorizer training for document classification
Production deployment and optimization patterns

Out of Scope (use other tools/skills):

Training custom NER models (different workflow)
spaCy 2.x (deprecated, incompatible with 3.x)
Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
Custom tokenizers or language models

Quick Start

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

Model	Size	Speed	Use Case
`en_core_web_sm`	12 MB	Fastest	Prototyping, speed-critical
`en_core_web_md`	40 MB	Fast	General use with word vectors
`en_core_web_lg`	560 MB	Fast	Semantic similarity tasks
`en_core_web_trf`	438 MB	Slow	Maximum accuracy (GPU)

Verify Installation

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md

Text Processing

Basic Pipeline

nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md

Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

Prepare data → Run scripts/prepare_training_data.py
Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
Validate → python -m spacy debug data config.cfg (catches issues before training)
Train → python -m spacy train config.cfg --output ./output
Evaluate → Run scripts/evaluate_model.py
Use → nlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md

Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

python -m spacy download en_core_web_sm

Alternative (avoids path issues):

import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

python -m spacy validate

For more troubleshooting: See references/troubleshooting.md

Production Deployment

Package Model

python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

Technique	Speedup	When to Use
Disable components	2-3x	Don't need all annotations
`nlp.pipe()`	5-10x	Processing multiple texts
Multiprocessing	2-4x	CPU-bound, many cores
GPU	2-5x	Transformer models

For evaluation metrics and hyperparameter tuning: See references/production.md

Scripts Reference

Script	Purpose	Usage
`prepare_training_data.py`	Convert JSON to DocBin	`python scripts/prepare_training_data.py --input data.json`
`generate_config.py`	Create training config	`python scripts/generate_config.py --categories "A,B,C"`
`evaluate_model.py`	Detailed metrics	`python scripts/evaluate_model.py --model ./output/model-best`
`serve_model.py`	FastAPI server	`python scripts/serve_model.py --model ./model --port 8000`

Assets Reference

Asset	Purpose	Usage
`config_textcat.cfg`	Base training config	Copy and customize for your labels
`training_data_template.json`	Data format example	Reference for preparing your data

using-spacy-nlpSafety 100Repository

Package Files

spaCy NLP

Contents

Scope

Quick Start

Installation

Standard Setup

Model Selection

Verify Installation

Text Processing

Basic Pipeline

Named Entity Recognition

Batch Processing (Critical for Production)

Disable Unused Components

Training Classifiers

Workflow Overview

Data Format

Training Command

Using Trained Model

Troubleshooting

Model Not Found (E050)

Memory Issues

GPU Not Working

Version Compatibility

Production Deployment

Package Model

FastAPI Server

Performance Optimization

Scripts Reference

Assets Reference

Install

AI Quality Score

Metadata

Tags

using-spacy-nlpSafety 100Repository ShareFavorite skill

Package Files

spaCy NLP

Contents

Scope

Quick Start

Installation

Standard Setup

Model Selection

Verify Installation

Text Processing

Basic Pipeline

Named Entity Recognition

Batch Processing (Critical for Production)

Disable Unused Components

Training Classifiers

Workflow Overview

Data Format

Training Command

Using Trained Model

Troubleshooting

Model Not Found (E050)

Memory Issues

GPU Not Working

Version Compatibility

Production Deployment

Package Model

FastAPI Server

Performance Optimization

Scripts Reference

Assets Reference

Install

AI Quality Score

Metadata

Tags

using-spacy-nlpSafety 100Repository