Index Management | Documentation

The Index Management module handles Elasticsearch index metadata retrieval, cluster health monitoring, and automated index lifecycle management. It ensures system stability by actively monitoring disk space and executing automated cleanup policies when critical thresholds are reached.

Elasticsearch Integration

This component interfaces directly with the underlying Elasticsearch cluster to manage index states, retrieve mapping properties, and execute protective disk-space measures.

Architecture & Workflow

The automated cleanup subsystem runs as a background daemon, constantly evaluating the health of the Elasticsearch cluster. When disk space becomes constrained, it escalates through warning and critical states to prevent cluster failure.

Index Metadata Retrieval

The system provides robust APIs for querying index structures and metadata, supporting pagination, sorting, and pattern matching.

Fetching Indices

The getAllIndexes routine retrieves a paginated list of indices from the cluster. It includes a safety mechanism to optionally filter out Elasticsearch internal system indices.

System indices (which begin with a . prefix, such as .kibana or .security) are excluded from the results by default unless the includeSystemIndex parameter is explicitly set to true.

Retrieving Index Properties

To understand the schema of a given index, the getIndexProperties method extracts all field mappings. It returns a list of IndexPropertyType objects, detailing the name and data type of each field within the specified index pattern. If the requested index pattern does not exist, it safely returns an empty list rather than throwing an exception.

Automated Storage Protection

To prevent Elasticsearch from crashing due to disk exhaustion (which can lead to corrupted shards or a read-only cluster state), the preventSystemCrashBySpace scheduled task runs every 60 seconds.

Disk Usage Thresholds

The system evaluates the diskUsedPercent metric from the cluster nodes and reacts according to the following thresholds:

The warning email alert is throttled using the UtmSpaceNotificationControl registry. Once triggered, the system enforces a 24-hour cooldown period before sending another alert, preventing inbox flooding during sustained high-usage periods.

Automated Deletion Protocol

When disk usage reaches the critical 85% threshold, the system begins purging data to stabilize the cluster. This process is highly controlled to prevent accidental deletion of critical system data.

The system queries Elasticsearch for indices matching the configured log pattern (typically log-*). The results are strictly sorted by CreationDate in ascending order, ensuring the oldest data is targeted first.

Before any deletion occurs, the system checks the indexPolicyService.isIndexRemovable() flag. If an index is locked, marked for legal hold, or otherwise protected, it is skipped.

The index is deleted from the cluster. An audit event is immediately created (ApplicationEventType.INFO) recording the index name, creation date, document count, and storage size of the deleted index.

The thread sleeps for exactly 10 seconds. This crucial delay allows Elasticsearch to process the deletion, update its internal disk usage metrics, and prevents the system from over-deleting data. After the delay, the cluster status is re-evaluated.

Automated deletion is permanent. The system will continue deleting the oldest removable indices one by one until the cluster's overall disk usage drops back below the 70% safe threshold. Ensure your retention policies and storage capacities are properly aligned.

Implementation Details

Sorting Conversion

When retrieving indices, the system translates application-level pagination sorting into Elasticsearch-native sorting using a custom builder:

private IndexSort from(Sort sort) {
    if (Objects.isNull(sort) || sort.isUnsorted())
        return IndexSort.unSorted();
        
    IndexSort.Builder sortBuilder = IndexSort.builder();
    sort.forEach(order -> sortBuilder.with(
        IndexSortableProperty.fromJsonValue(order.getProperty()),
        order.getDirection().isAscending() ? SortOrder.Asc : SortOrder.Desc
    ));
    return sortBuilder.build();
}

Error Handling

All Elasticsearch operations are wrapped in robust try-catch blocks. If the deletion protocol fails for a specific index (e.g., due to a cluster timeout or permission issue), an ApplicationEventType.WARNING event is logged, and the loop continues to the next available index, ensuring that a single failure doesn't halt the entire recovery process.

Troubleshooting

If your cluster is severely undersized, the automated cleanup might burn through all old indices and start deleting newer ones to satisfy the < 70% disk usage requirement. You must either increase your storage capacity or reduce your data ingestion rate.

The automated deletion process only targets indices matching the system log pattern (e.g., log-*). Additionally, it checks indexPolicyService.isIndexRemovable(). To protect an index, ensure it does not match the target pattern or update its policy state to non-removable.

This is expected behavior if your disk usage is between 70% and 84%. The system is warning you to take manual action. Automated deletion only begins if the usage reaches or exceeds 85%.