Document Management

Document Management

Documents are JSON objects stored within an Index. In Sigmie, you work with documents through the Document class and manage them using collection methods.

Introduction

Sigmie treats an Index as a Collection that contains instances of Document\Document. This provides a fluent and intuitive API for managing your documents.

Creating Documents

Basic Document Creation

use Sigmie\Document\Document;
 
// Simple document
$document = new Document(['name' => 'Snow White']);
 
// Document with multiple fields
$document = new Document([
    'title' => 'The Lion King',
    'genre' => 'Animation',
    'year' => 1994,
    'rating' => 8.5,
    'tags' => ['family', 'musical', 'coming-of-age']
]);

Document with Custom ID

// Document with specific ID
$document = new Document([
    'title' => 'Frozen',
    'genre' => 'Animation'
], 'movie_123');  // Custom document ID

Complex Document Structures

// Nested document structure
$document = new Document([
    'title' => 'Inception',
    'director' => [
        'name' => 'Christopher Nolan',
        'birth_year' => 1970
    ],
    'cast' => [
        ['name' => 'Leonardo DiCaprio', 'role' => 'Dom Cobb'],
        ['name' => 'Marion Cotillard', 'role' => 'Mal']
    ],
    'metadata' => [
        'runtime' => 148,
        'budget' => 160000000,
        'box_office' => 836800000
    ]
]);

Collecting an Index

To work with documents in an Index, you first need to “collect” it:

// Basic collection
$movies = $sigmie->collect('movies');
 
// Collection with refresh for immediate availability
$movies = $sigmie->collect('movies', refresh: true);

The refresh: true parameter makes documents immediately searchable, which is useful for testing but should be avoided in production.

Danger

Using refresh: true is NOT recommended in production code as it impacts performance.

Adding Documents

Adding Single Documents

$document = new Document(['name' => 'Mickey Mouse']);
$movies = $sigmie->collect('movies');
 
$movies->add($document);

Adding Multiple Documents

$documents = [
    new Document(['name' => 'Snow White']),
    new Document(['name' => 'Cinderella']),
    new Document(['name' => 'Sleeping Beauty'])
];
 
$movies = $sigmie->collect('movies', refresh: true);
$movies->merge($documents);

Bulk Operations

For better performance with large datasets:

$documents = [];
for ($i = 0; $i < 1000; $i++) {
    $documents[] = new Document([
        'title' => "Movie {$i}",
        'year' => rand(1950, 2024),
        'rating' => rand(1, 10)
    ]);
}
 
$movies = $sigmie->collect('movies');
$movies->merge($documents);  // Bulk insert

Document Validation with Properties

When using properties, documents are automatically validated:

use Sigmie\Mappings\NewProperties;
 
$properties = new NewProperties;
$properties->name('title');
$properties->date('release_date');
$properties->number('rating')->float();
 
// Valid document
$validDoc = new Document([
    'title' => 'The Matrix',
    'release_date' => '1999-03-31T00:00:00Z',
    'rating' => 8.7
]);
 
// Invalid document (will be caught during indexing)
$invalidDoc = new Document([
    'title' => 'Invalid Movie',
    'release_date' => 'not-a-date',  // Invalid date format
    'rating' => 'not-a-number'       // Invalid rating
]);
 
$movies = $sigmie->collect('movies')
    ->properties($properties)
    ->merge([$validDoc, $invalidDoc]);  // Validation occurs here

Indexing Timing

Async Indexing (Default)

By default, Elasticsearch operates in “near real-time” mode:

$sigmie->newIndex('movies')->create();
 
$doc = new Document(['name' => 'Snow White']);
$movies = $sigmie->collect('movies');
$movies->add($doc);
 
$movies->count(); // 0 - document not immediately available

Documents are usually available for searching after about 1 second.

Sync Indexing (Testing)

For testing or when you need immediate availability:

$doc = new Document(['name' => 'Snow White']);
$movies = $sigmie->collect('movies', refresh: true);
$movies->add($doc);
 
$movies->count(); // 1 - document immediately available

Working with Collections

Counting Documents

$movies = $sigmie->collect('movies', refresh: true);
$totalMovies = $movies->count();

Checking Collection State

$movies = $sigmie->collect('movies');
 
// Check if collection is "alive" (has real-time data)
if ($movies instanceof AliveCollection) {
    // Real-time collection with refresh enabled
    $count = $movies->count();
}

Iterating Through Documents

$movies = $sigmie->collect('movies', refresh: true);
 
// Add some documents first
$movies->merge([
    new Document(['title' => 'Movie 1']),
    new Document(['title' => 'Movie 2']),
    new Document(['title' => 'Movie 3'])
]);
 
// Lazy iteration (memory efficient for large collections)
$movies->each(function (Document $document) {
    echo $document['title'] . "\n";
});

Converting to Array

$movies = $sigmie->collect('movies', refresh: true);
$movies->merge([/* documents */]);
 
// Get all documents as array
$documentsArray = $movies->toArray();

Getting Random Documents

You can retrieve random documents from a collection using the random() method:

$movies = $sigmie->collect('movies');
 
// Get 10 random documents (returns a collection)
$randomMovies = $movies->random(10);
 
// Get a single random document
$randomMovie = $movies->random(1);
 
// Convert random documents to array
$randomArray = $movies->random(5)->toArray();

This is useful for:

Displaying sample data in your UI
Testing and development
Creating recommendation systems
Generating preview content

Document Operations

Updating Documents

To update documents, you typically re-index them with the same ID:

// Original document
$original = new Document([
    'title' => 'The Matrix',
    'year' => 1999
], 'matrix_1');
 
$movies = $sigmie->collect('movies', refresh: true);
$movies->add($original);
 
// Updated document (same ID)
$updated = new Document([
    'title' => 'The Matrix',
    'year' => 1999,
    'rating' => 8.7,  // New field
    'updated_at' => date('c')
], 'matrix_1');
 
$movies->add($updated);  // This will update the existing document

Deleting Documents

Currently, document deletion is handled through Elasticsearch’s native APIs or by reindexing without the unwanted documents.

Working with Complex Data Types

Date Fields

$properties = new NewProperties;
$properties->date('created_at');
 
$document = new Document([
    'title' => 'New Movie',
    'created_at' => '2023-04-07T12:38:29.000000Z'  // ISO format
]);

Geo Points

$properties = new NewProperties;
$properties->geoPoint('location');
 
$document = new Document([
    'venue' => 'Cinema Downtown',
    'location' => [
        'lat' => 40.7128,
        'lon' => -74.0060
    ]
]);

Nested Objects

$properties = new NewProperties;
$properties->nested('cast', function (NewProperties $props) {
    $props->name('actor');
    $props->keyword('role');
});
 
$document = new Document([
    'title' => 'Avengers',
    'cast' => [
        ['actor' => 'Robert Downey Jr.', 'role' => 'Iron Man'],
        ['actor' => 'Chris Evans', 'role' => 'Captain America']
    ]
]);

Performance Considerations

Batch Operations

Always prefer batch operations for multiple documents:

// Good: Batch operation
$movies->merge($manyDocuments);
 
// Avoid: Individual operations
foreach ($manyDocuments as $doc) {
    $movies->add($doc);  // Inefficient for large datasets
}

Memory Management

For large collections, use lazy iteration:

// Memory efficient for large datasets
$movies->each(function (Document $doc) {
    // Process each document
    processDocument($doc);
});
 
// Memory intensive for large datasets
$allDocs = $movies->toArray();  // Loads everything into memory

Index Optimization

Consider refresh strategies based on your use case:

// Production: Let Elasticsearch handle refresh timing
$movies = $sigmie->collect('movies');
 
// Development/Testing: Force immediate refresh
$movies = $sigmie->collect('movies', refresh: true);
 
// Batch processing: Disable refresh during bulk operations
$movies = $sigmie->collect('movies', refresh: false);
// ... add many documents ...
// Manually refresh when done
$sigmie->index('movies')->refresh();

Common Patterns

E-commerce Products

$properties = new NewProperties;
$properties->name('name');
$properties->longText('description');
$properties->price('price');
$properties->category('category');
$properties->tags('tags');
$properties->bool('in_stock');
$properties->date('created_at');
 
$product = new Document([
    'name' => 'Wireless Headphones',
    'description' => 'High-quality wireless headphones with noise cancellation',
    'price' => 199.99,
    'category' => 'Electronics',
    'tags' => ['audio', 'wireless', 'noise-cancelling'],
    'in_stock' => true,
    'created_at' => date('c')
]);
 
$products = $sigmie->collect('products')
    ->properties($properties)
    ->merge([$product]);

User Profiles

$properties = new NewProperties;
$properties->name('username');
$properties->email('email');
$properties->number('age')->integer();
$properties->tags('interests');
$properties->nested('address', function (NewProperties $props) {
    $props->keyword('street');
    $props->keyword('city');
    $props->keyword('country');
});
 
$user = new Document([
    'username' => 'john_doe',
    'email' => '[email protected]',
    'age' => 30,
    'interests' => ['technology', 'sports', 'travel'],
    'address' => [
        'street' => '123 Main St',
        'city' => 'New York',
        'country' => 'USA'
    ]
]);

Content Management

$properties = new NewProperties;
$properties->title('title');
$properties->longText('content');
$properties->name('author');
$properties->tags('tags');
$properties->category('category');
$properties->date('published_at');
$properties->bool('is_published');
 
$article = new Document([
    'title' => 'Getting Started with Elasticsearch',
    'content' => 'Elasticsearch is a powerful search engine...',
    'author' => 'Jane Smith',
    'tags' => ['elasticsearch', 'search', 'tutorial'],
    'category' => 'Technology',
    'published_at' => '2024-01-15T10:00:00Z',
    'is_published' => true
]);

Error Handling

try {
    $movies = $sigmie->collect('movies', refresh: true);
    $movies->merge($documents);
 
    echo "Indexed " . count($documents) . " documents successfully";
} catch (Exception $e) {
    echo "Error indexing documents: " . $e->getMessage();
}

Best Practices

Use Batch Operations: Always prefer merge() over individual add() calls for multiple documents
Validate Data: Use properties to validate document structure
Handle Dates Properly: Use ISO 8601 format for date fields
Memory Management: Use lazy iteration for large datasets
Error Handling: Always wrap operations in try-catch blocks
Production Refresh: Avoid refresh: true in production environments
Custom IDs: Use meaningful document IDs when you need to update specific documents

// Good pattern
$properties = new NewProperties;
$properties->name('title');
$properties->date('created_at');
 
try {
    $movies = $sigmie->collect('movies')
        ->properties($properties)
        ->merge($validatedDocuments);
 
    echo "Successfully indexed documents";
} catch (Exception $e) {
    logger()->error("Document indexing failed: " . $e->getMessage());
}

Document

#Document Management

#Introduction

#Creating Documents

#Basic Document Creation

#Document with Custom ID

#Complex Document Structures

#Collecting an Index

#Adding Documents

#Adding Single Documents

#Adding Multiple Documents

#Bulk Operations

#Document Validation with Properties

#Indexing Timing

#Async Indexing (Default)

#Sync Indexing (Testing)

#Working with Collections

#Counting Documents

#Checking Collection State

#Iterating Through Documents

#Converting to Array

#Getting Random Documents

#Document Operations

#Updating Documents

#Deleting Documents

#Working with Complex Data Types

#Date Fields

#Geo Points

#Nested Objects

#Performance Considerations

#Batch Operations

#Memory Management

#Index Optimization

#Common Patterns

#E-commerce Products

#User Profiles

#Content Management

#Error Handling

#Best Practices