Magic Tags

LLM-generated taxonomy tags for Elasticsearch documents — classify with embeddings, dedup with vector similarity, share a tag registry across indices.

On this page

Magic Tags adds a keyword field whose values come from an LLM: short, reusable labels (typically kebab-case) that describe another field’s content. The pipeline favors reusing existing tags so your vocabulary stays stable for search, filtering, and downstream agent tooling.

Note: Magic Tags is not part of this repository. It’s a separate Sigmie package that registers a magicTags() macro on NewProperties and a CollectionHook for indexing. This page documents the intended behavior. Examples assume the package is installed and registered.

Behind the scenes, the package maintains a sidecar index of unique tags with semantic embeddings on the tag text. The sidecar uses the same embeddings API and dimensions as your source field, so vector operations on tags stay consistent with the rest of your data.

Install the package

use Sigmie\Sigmie;
use Vendor\MagicTags\MagicTagsPackage;
 
$sigmie = new Sigmie($connection);
$sigmie->extend(new MagicTagsPackage());

extend() registers the macro and hook on this Sigmie instance. See Extending Sigmie for the package interface.

How it fits together

Main index documents Sidecar index (tag registry)
+---------------------------+ +-----------------------------+
| content (semantic text) | | magic_field_path (keyword) |
| topic (magic_tags) | sync | tag (short text + vectors) |
| _embeddings.content ... | ─────────► | _embeddings.tag ... |
+---------------------------+ +-----------------------------+
│ │
│ │
LLM + optional Same embedding API as
classify-first the source field

The main index stores tags as an array of strings on each document (mapped as keyword with meta.type magic_tags).

The sidecar index name defaults to {logicalName}__sigmie_magic_tags. Each row is one (magic_field_path, tag) pair with a deterministic _id (md5(path::tag)) so repeated writes upsert rather than duplicate.

Define magic tags on a mapping

The source field must be a semantic text field — the package reads its embeddings configuration to set up the sidecar:

use Sigmie\Mappings\NewProperties;
 
$props = new NewProperties;
 
$props->text('content')
->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);
 
$props->magicTags('topic', fromField: 'content')
->api('my-llm');

Register the same API names on the collection:

$collection = $sigmie->collect('kb', refresh: true)
->properties($props)
->apis([
'my-llm' => $llmApi,
'my-embeddings' => $embeddingsApi,
]);

Now merge() and add() run the magic-tag pipeline:

use Sigmie\Document\Document;
 
$collection->merge([
new Document(['content' => 'How to reset a circuit breaker.']),
]);

The document gets a topic array populated by the LLM, with classification as a fast path when enough tags already exist.

Generation order

When you index a document:

  1. Classify-first (optional). If classifyFirst(true) and the sidecar has enough tags, the package embeds the source text and scores it against centroids built from sample passages per tag. Tags above the confidence threshold are applied without an LLM call.
  2. LLM fallback. If classification returns nothing, the LLM generates tags from the source text plus a prompt listing existing tags for reuse.
  3. Dedup. New tags are deduplicated against existing ones using embedding similarity.

Configure classification and dedup

$props->magicTags('topic', fromField: 'content')
->api('my-llm')
->embeddingsApi('my-embeddings')
->embeddingDimensions(1024)
->classifyFirst(true)
->minTagsForClassification(10) // need 10+ tags before classifying
->classifyConfidence(0.3) // minimum centroid similarity
->classifySamplesPerTag(5) // passages per tag for centroid
->similarityThreshold(0.85) // dedup threshold
->maxTags(5);

Disable classification entirely:

$props->magicTags('topic', fromField: 'content')
->api('my-llm')
->classifyFirst(false);

Custom prompt

Override the default LLM instructions:

$props->magicTags('topic', fromField: 'content')
->api('my-llm')
->prompt(
'You tag property-management support content. Return up to 5 lowercase '.
'kebab-case tags. Prefer reusing tags from the existing list when applicable.'
);

Share one registry across indices

Point several mappings at the same tagIndex() logical name to share a single sidecar. Main index names stay different:

$shared = 'property_app';
 
$kb = new NewProperties;
$kb->text('content')->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);
$kb->magicTags('topic', fromField: 'content')
->api('my-llm')
->tagIndex($shared);
 
$memory = new NewProperties;
$memory->text('content')->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);
$memory->magicTags('topic', fromField: 'content')
->api('my-llm')
->tagIndex($shared);

Both collections write to property_app__sigmie_magic_tags. Tag rows record magic_field_path so you can tell which source field produced each tag.

Note: The “existing tags” list shown to the LLM during generation is fetched from the current main index only. If you want a global vocabulary across collections for the prompt, merge tag lists yourself before calling merge().

Skip the pipeline for a batch

When documents already carry final tag values:

$collection->withoutHooks()->merge($documents);

Use tags in an agent tool

A chatbot or filter UI typically wants a list of available tags. Run a terms aggregation on the magic-tag field:

use Sigmie\Query\Aggs;
use Sigmie\Query\Queries\MatchAll;
use Sigmie\Query\Search as QuerySearch;
 
$aggs = new Aggs;
$aggs->terms('by_topic', 'topic')->size(20);
 
$response = (new QuerySearch($connection))
->index('kb')
->query(new MatchAll)
->aggs($aggs)
->size(0)
->get();
 
$buckets = $response->json('aggregations.by_topic.buckets');
// [['key' => 'returns', 'doc_count' => 42], ['key' => 'shipping', 'doc_count' => 18], ...]

This is separate from the internal tag list used during generation (which uses a larger size, often 500, so the LLM sees a broad vocabulary).

See Aggregations.

What the package contains

A Magic Tags package typically registers:

  • NewProperties::macro('magicTags', ...) so mappings can call magicTags().
  • A CollectionHook via $sigmie->addCollectionHook(...) implementing:
    • shouldRun() — checks Properties::fieldsOfType(MagicTags::class) so unrelated collections skip the hook.
    • beforeBatch() — ensures the sidecar index exists.
    • processBatch() — LLM + classification + dedup.
    • afterBatch() — upserts tag rows into the sidecar.

See Extending Sigmie for the Package interface and the hook lifecycle.

See also