askill
using-spacy-nlp

using-spacy-nlpSafety 100Repository

Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.

1 stars
1.2k downloads
Updated 1/7/2026

Package Files

Loading files...
SKILL.md

spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Contents


Scope

In Scope:

  • spaCy 3.x installation and text processing
  • TextCategorizer training for document classification
  • Production deployment and optimization patterns

Out of Scope (use other tools/skills):

  • Training custom NER models (different workflow)
  • spaCy 2.x (deprecated, incompatible with 3.x)
  • Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
  • Custom tokenizers or language models

Quick Start

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

ModelSizeSpeedUse Case
en_core_web_sm12 MBFastestPrototyping, speed-critical
en_core_web_md40 MBFastGeneral use with word vectors
en_core_web_lg560 MBFastSemantic similarity tasks
en_core_web_trf438 MBSlowMaximum accuracy (GPU)

Verify Installation

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md


Text Processing

Basic Pipeline

nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md


Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

  1. Prepare data → Run scripts/prepare_training_data.py
  2. Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
  3. Validatepython -m spacy debug data config.cfg (catches issues before training)
  4. Trainpython -m spacy train config.cfg --output ./output
  5. Evaluate → Run scripts/evaluate_model.py
  6. Usenlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md


Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

python -m spacy download en_core_web_sm

Alternative (avoids path issues):

import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

python -m spacy validate

For more troubleshooting: See references/troubleshooting.md


Production Deployment

Package Model

python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

TechniqueSpeedupWhen to Use
Disable components2-3xDon't need all annotations
nlp.pipe()5-10xProcessing multiple texts
Multiprocessing2-4xCPU-bound, many cores
GPU2-5xTransformer models

For evaluation metrics and hyperparameter tuning: See references/production.md


Scripts Reference

ScriptPurposeUsage
prepare_training_data.pyConvert JSON to DocBinpython scripts/prepare_training_data.py --input data.json
generate_config.pyCreate training configpython scripts/generate_config.py --categories "A,B,C"
evaluate_model.pyDetailed metricspython scripts/evaluate_model.py --model ./output/model-best
serve_model.pyFastAPI serverpython scripts/serve_model.py --model ./model --port 8000

Assets Reference

AssetPurposeUsage
config_textcat.cfgBase training configCopy and customize for your labels
training_data_template.jsonData format exampleReference for preparing your data

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

96/100Analyzed 2/10/2026

An excellent, comprehensive guide for spaCy 3.x that covers the full development lifecycle from installation to production deployment. It provides high-density technical information, clear code examples, and actionable troubleshooting steps.

100
95
90
98
95

Metadata

Licenseunknown
Version-
Updated1/7/2026
PublisherSpillwaveSolutions

Tags

apici-cdgithub-actionsobservabilitytesting