Posts by Elastic

AI adoption in security: Top use cases and mistakes to avoid

AI adoption in security: Top use cases and mistakes to avoid

Widespread implementation of artificial intelligence (AI) in security presents a paradox. On one hand, it helps security experts combat advanced threats at scale. On the other hand, AI is also contributing to the scale of sophistication of adversaries' threat campaigns.

To fight fire with fire, organizations are increasingly automating security processes to make up for the uneven playing field on which they find themselves. In this landscape, AI in cybersecurity is necessary to move from reactive defenses to proactive protection. However, AI adoption isn’t without its challenges and considerations.

This article explores how AI is transforming security operations, the top-value use cases it’s delivering, and key mistakes to avoid when bringing AI into your
security operations center (SOC).

Elastic Cloud Hosted achieves FedRAMP® High "In Process" status

We’re excited to announce that Elastic has achieved FedRAMP® High “In Process” status for Elastic Cloud Hosted on AWS GovCloud (US). This designation from the US Federal Risk and Authorization Management Program (FedRAMP) Program Management Office builds on Elastic’s continued commitment to provide the US federal government with secure, compliant, and transparently priced technology solutions.

This milestone comes on the heels of Elastic and GSA’s June announcement of a volume-based discount buying program for US federal agencies. The program streamlines the procurement process, builds efficiencies of scale, and increases time to value.

1 https://www.fedramp.gov/understanding-baselines-and-impact-levels/

APM best practices: Dos and don’ts guide for practitioners

Application performance management (APM) is the practice of regularly tracking, measuring, and analyzing the performance and availability of software applications. APM helps you get visibility into complex microservices environments, which can overwhelm site reliability engineering (SRE) teams. The generated insights create an optimal user experience and achieve desired business outcomes. It’s a complex process, but the goal is straightforward: ensuring that an application runs smoothly and meets the expectations of users and businesses. 

A clear understanding of an application's operation and a proactive APM practice are crucial for maintaining high-performing software applications. APM shouldn’t be an afterthought. It should be considered from the beginning. When implemented proactively, it can be incorporated into how software runs by embedding monitoring components directly into the application.

# Auto-instrumentation handles this automatically @app.route('/api/orders') def create_order(): # Add manual span only for critical business logic with tracer.start_as_current_span("order.validation") as span: span.set_attribute("order.value", order_total) if not validate_order(order_data): span.set_status(Status(StatusCode.ERROR)) return 400

  • Do: Start with auto-instrumentation, then add manual spans for business-critical operations.

  • Don't: Manually instrument every function call — you'll create performance overhead and noise.

  • Pitfall: Over-instrumentation can add 15%–20% latency. Monitor your monitoring with baseline performance comparisons.

A few components for an organization or business to consider when developing an APM strategy are:

  • Performance monitoring, including evaluating latency, service level objectives, response time, throughput, and request volumes

  • Error tracking, including exceptions, crashes, and failed API calls 

  • Infrastructure monitoring, including health and resource usage of servers, containers, and cloud environments that support the application

  • User experience metrics, including load times, session performance, click paths, and browser or device details (It’s important to keep in mind that even if system metrics look fine, users may still encounter performance issues.)
Key principles of effective APM

The core principles of effective application performance management are end-to-end visibility (from the user's browser to the database), real-time monitoring and insights, and contextual insights, with a user- and business-objective focus. APM can improve application scalability by enabling continuous improvements and increasing performance over time.

  • Do: Implement real-time dashboards with SLO-based alerts rather than arbitrary thresholds.

  • Don't: Rely only on periodic performance reviews or CPU/memory alerts — instrument user experience metrics.

  • Pitfall: Alert fatigue from low-level system metrics. Focus on user-facing SLOs that indicate real problems.

When creating an APM strategy, here are a few key principles to consider:

1. Proactive monitoring: Prevent issues before they impact users by setting up alerts and responding quickly to any anomalies. But try to avoid alert fatigue. Balance automated alerts with human oversight so important issues don’t get missed, focusing on outcomes rather than system metrics. 

2. Real-time insights: Move beyond logging issues and enable fast decision-making based on live data and real-time dashboards that prioritize the most critical business transactions. Use telemetry data (logs, metrics, and traces) to parse your performance insights.

3. End-to-end visibility: Monitor the application across the entire environment, the entire user flow, and all layers, from frontend to backend.

4. User-centric approach: Prioritize performance and experience from an end-user perspective, while considering key business objectives.

5. Real user monitoring: The work doesn’t stop when it’s in your user’s hands. By monitoring their experience, you can iterate and improve based on their feedback.

6. Continuous improvement: Use insights to optimize over time and regularly uncover and tackle unreported issues. Issues should be addressed dynamically rather than when discovered in periodic performance reviews. 

7. Context propagation: Ensure trace context flows through your entire request path, especially across service boundaries:

# Outgoing request - inject context headers = {} propagate.inject(headers) response = requests.post('http://service-b/process', headers=headers)

8. Sampling strategy: Use intelligent sampling to balance visibility with performance:

  • 1%–10% head-based sampling for high-traffic services

  • 100% sampling for errors and slow requests using tail-based sampling

  • Monitor instrumentation overhead — aim for <5% performance impact

@RestController public class OrderController { @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { // Auto-instrumentation captures this endpoint automatically // Add custom business context Span.current().setAttributes(Attributes.of( stringKey("order.value"), String.valueOf(request.getTotal()), stringKey("user.tier"), request.getUserTier() )); return ResponseEntity.ok(processOrder(request)); } }

  • Do: Implement sampling strategies and monitor instrumentation overhead in production.

  • Don't: Use 100% sampling for high-traffic services — you'll impact performance and explode storage costs.

  • Pitfall: Head-based sampling can miss critical error traces. Use tail-based sampling to capture all errors while reducing volume.

Here’s how to get it right:

  • Select the right APM solution: The right APM tool should align with an application's architecture and the organization's needs. The solution should provide an organization with the tools and capabilities it needs to monitor, track, measure, and analyze its software applications. A business may use OpenTelemetry, an open source observability framework, to instrument and collect telemetry data (traces, metrics, and logs) from applications. 

  • Manage cardinality to control costs: High-cardinality attributes can make metrics unusable and expensive:
# Good - bounded cardinality span.set_attribute("user.tier", user.subscription_tier) # 3-5 values span.set_attribute("http.status_code", response.status_code) # ~10 values # Bad - unbounded cardinality span.set_attribute("user.id", user.id) # Millions of values span.set_attribute("request.timestamp", now()) # Infinite values
  • Set up intelligent alerting based on SLOs rather than arbitrary thresholds. Use error budgets to determine when to page someone:
slos: - name: checkout_availability target: 99.9% window: 7d - name: checkout_latency target: 95% # 95% of requests under 500ms window: 7d

  • Train teams and promote collaboration. An APM strategy impacts a wide range of stakeholders, not just developers. Be sure to involve IT teams and other business stakeholders in cross-departmental collaboration. Work together by implementing APM into your organizational setup. Make sure to establish clear goals and KPIs that align with business needs and consider user experience. 

  • Review and evaluate. An APM strategy continues to evolve and change alongside application and business needs.
order_processing_duration = Histogram( "order_processing_seconds", "Time to process orders", ["payment_method", "order_size"] ) with order_processing_duration.labels( payment_method=payment.method, order_size=get_size_bucket(order.total) ).time(): process_order(order)
  • Synthetic monitoring: Simulates user interactions to detect issues before real users are affected. Critical for external dependencies:
// Synthetic check for critical user flow const syntheticCheck = async () => { const span = tracer.startSpan('synthetic.checkout_flow'); try { await loginUser(); await addItemToCart(); await completePurchase(); span.setStatus({code: SpanStatusCode.OK}); } catch (error) { span.recordException(error); span.setStatus({code: SpanStatusCode.ERROR}); throw error; } finally { span.end(); } };

  • Deep-dive diagnostics and profiling: Helps troubleshoot complex performance bottlenecks, which could include third-party plugins or tools. Through application profiling, you can go deeper into your data and analyze how it is performing according to its functions.

  • Distributed tracing: Essential for microservices architectures. Handle context propagation carefully across async boundaries:
# Event-driven systems - propagate context through messages def publish_order_event(order_data): headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } kafka_producer.send('order-events', message) APM data analysis and insights

Monitoring and gathering data is just the beginning. Businesses need to understand how to interpret application performance management data for tuning and decision-making.

Identifying trends and patterns helps teams proactively detect issues. Use correlation analysis to link user complaints with backend performance. See an example here using ES|QL (Elastic’s query language):

FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" AND @timestamp <= "2024-06-06T10:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | KEEP trace.id, duration_ms, transaction.name, service.name, transaction.result | WHERE duration_ms > 2000 | SORT duration_ms DESC | LIMIT 10

Detecting bottlenecks: APM reveals common performance anti-patterns such as n+1 problems that can be seen in the code below. Use APM to optimize the code:

# N+1 query problem detected by APM def get_user_orders_slow(user_id): user = User.query.get(user_id) orders = [] for order_id in user.order_ids: # Each iteration = 1 DB query orders.append(Order.query.get(order_id)) return orders # Optimized after APM analysis def get_user_orders_fast(user_id): return Order.query.filter(Order.user_id == user_id).all() # Single query

Correlating metrics and linking user complaints with backend performance data, including historical data, reveals how different parts of the system interact. This can help teams accurately diagnose root causes and understand the full impact of performance issues.

Automating root cause analysis and using AI/machine learning-based tools such as AIOps helps to accelerate diagnostics and resolution by pinpointing the source of problems, reducing downtime, and freeing up resources.

It’s important to use a holistic picture of your data to inform future decisions. The more data you have, the more you can leverage.

  • Do: Use distributed traces to identify the specific service and operation causing slowdowns.

  • Don't: Assume correlation means causation — verify with code-level profiling data.

  • Pitfall: Legacy systems often appear as black boxes in traces. Use log correlation and synthetic spans to maintain visibility.

// Java - Auto-propagation with Spring Cloud @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { Span.current().setAttributes(Attributes.of( stringKey("order.type"), request.getOrderType(), longKey("order.value"), request.getTotalValue())); // OpenFeign automatically propagates context to downstream services return paymentClient.processPayment(request.getPaymentData());} // Go - Manual context extraction and propagation func processHandler(w http.ResponseWriter, r *http.Request) { ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header)) ctx, span := tracer.Start(ctx, "process_payment") defer span.End() // Continue with trace context maintained}

Legacy system integration: Create observability bridges for systems that can't be directly instrumented:

# Synthetic spans with correlation IDs for mainframe calls with tracer.start_as_current_span("mainframe.account_lookup") as span: correlation_id = format(span.get_span_context().trace_id, '032x') logger.info("CICS call started", extra={ "correlation_id": correlation_id, "trace_id": span.get_span_context().trace_id }) result = call_mainframe_service(account_data, correlation_id) span.set_attribute("account.status", result.status)

Advanced trace analysis with ES|QL: Link user complaints to backend performance using Elastic's query language:

-- Find slow requests during complaint timeframe FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | WHERE duration_ms > 2000 | STATS avg_duration = AVG(duration_ms) BY service.name, transaction.name | SORT avg_duration DESC -- Correlate errors across service boundaries FROM traces-apm* | WHERE trace.id == "44b3c2c06e15d444a770b87daab45c0a" | EVAL is_error = CASE(transaction.result == "error", 1, 0) | STATS error_rate = SUM(is_error) / COUNT(*) * 100 BY service.name | WHERE error_rate > 0

Event-driven architecture patterns: Explicitly propagate context through message headers for async processing:

# Producer - inject context into message headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } await kafka_producer.send('order-events', message) # Consumer - extract and continue trace trace_headers = message.get('trace_headers', {}) context = propagate.extract(trace_headers) with tracer.start_as_current_span("order.process", context=context): await process_order(message['data'])

  • Do: Use ES|QL for complex trace analysis that traditional dashboards can't handle.

  • Don't: Try to instrument legacy systems directly — use correlation IDs and synthetic spans.

  • Pitfall: Message queues and async processing break trace context unless explicitly propagated through headers.

  • Key insight: Perfect instrumentation isn't always possible. Strategic use of correlation IDs, synthetic spans, and intelligent querying provides comprehensive observability even in complex, hybrid environments.

SOC analyst vs. security analyst: What’s the difference?

A security operations center (SOC) analyst enhances your security posture by defending the organization against cybersecurity threats. Responsible for monitoring, detecting, investigating, and responding to cyber threats, the SOC analyst is the first line of defense in keeping the organization’s IT ecosystem secure when an incident arises. 

A security analyst, similar to a SOC analyst, is responsible for proactive defense and security posture. However, security analysts tend to have a more strategic, preventive focus and may or may not work within the SOC

With such critical responsibilities, what does it take to become a SOC analyst or security analyst? Let’s explore the job, required skills, and the career path of both.

Challenges SOC analysts face

With a job so rewarding and critical for an organization, it’s no surprise that SOC analysts face many challenges. 

1. Alert fatigue: SOC analysts are overwhelmed by the volume of alerts, including false positives, generated by security tools. All these alerts require attention, triage, and intervention, potentially leading SOC analysts to overlook critical threats. 

The potential solution: AI-driven security analytics significantly reduces the noise and prioritizes critical alerts, saving security analysts time and effort.

2. High stress levels and burnout: SOC analysts operate in a high-pressure environment, amid constant demands to respond to yet another threat. Then, there’s the added pressure of a dynamic threat landscape and the need to keep up with emerging and advanced threat actors, new vulnerabilities, and attack techniques. 

The potential solution: An AI Assistant can help security analysts gain quicker insights and analysis and respond to threats faster and more efficiently.

3. Fear of being replaced by AI: As SOC analysts begin to rely on AI to make their jobs easier, many question whether their jobs will become obsolete. An AI Assistant can already triage alerts and monitor networks for threats more effectively than a junior security analyst. What will happen tomorrow?

The potential solution: AI won’t replace SOC teams, but it will fundamentally transform the role of tier 1 SOC analysts. Analysts will be able to forget about time-consuming manual tasks and get AI help in elevating their skills, so they can focus on more rewarding investigations and threat hunting.

How AI and contextual search enhance defence cybersecurity

In today’s defence environment, information is abundant, yet insight often remains elusive. While data pours in from every connected system, every edge device, and every digital touchpoint, security teams still spend too much time stitching together fragmented inputs, hunting for signals, and navigating silos just to answer basic questions. 

In defence cybersecurity, every minute spent digging through disconnected security logs is a minute adversaries can exploit. Each missed correlation or delayed response undermines the confidence of leadership, increases risk, and erodes operational advantage. 

Today’s
security operations teams are tasked with monitoring exponentially growing volumes of data across fragmented systems, often without the time, context, or personnel needed to turn information into action. As threats grow more sophisticated and move at machine speed, legacy search and analysis processes become a liability. Investigations take too long. Alerts go untriaged. And decisions are made on incomplete data, putting missions and teams at risk.

Security intelligence that’s battle-tested, not just boardroom-proven

Elastic's security capabilities received rigorous testing in NATO's Locked Shields exercise, one of the world's largest live-fire cybersecurity simulations. During the event, blue teams — defensive cybersecurity units — deployed a comprehensive security architecture integrating multiple data sources: OS event logs, PowerShell logs, firewall/IPS/IDS data, threat intelligence feeds, and endpoint detection and response capabilities. The environment mirrored real-world defence operations, with the Elastic Common Schema (ECS) normalising disparate data sources to streamline detection workflows. Security teams gained unified visibility across their entire digital estate through preconfigured dashboards that simplified complex analysis tasks.

Protection capabilities included malware and ransomware prevention, malicious behaviour analysis, memory threat protection, and credential hardening. All detection rules mapped to the
MITRE ATT&CK framework,2 enabling teams to understand adversary tactics and techniques while measuring defensive coverage. The exercise also tested defensive resilience. Red teams — simulating sophisticated threat actors with advanced persistent capabilities — actively attempted to disable security tools. Features like agent tamper protection ensured monitoring remained intact even under direct attack — a critical capability in contested environments.

Enhanced monitoring of Amazon EKS with Elastic add-on capabilities

Amazon Elastic Kubernetes Service (EKS) makes running Kubernetes on AWS simple and scalable. But as your workloads grow, so does the need for robust monitoring and observability. Enter Elastic Agent, a powerful, unified way to collect logs, metrics, and security data from your EKS clusters, all managed through Elastic Fleet. In this blog, we’ll walk through how to set up Elastic Agent on EKS, highlight key considerations, and share some tips for getting the most out of your monitoring stack.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt6bb8b4c3a1618135/68641ade02c701299db94eb2/Elastic-Agent-EKS-add-on.png,Elastic-Agent-EKS-add-on.png

Once the Elastic Agent is deployed in a pod, it automatically enrolls with Fleet, Elastic’s centralized management system, using the specified configuration values. After enrollment, Fleet provides full control over the agent, including its health status, configuration of integrations, and data ingestion. This setup enables centralized observability and security by ingesting and analyzing data in Elasticsearch, with visualization and management provided through Kibana.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt5e3108df764baac1/68641b1dbf423e9e9edd790b/fleet.png,fleet.pngStep-by-Step: Deploying Elastic Agent on Amazon EKS

Let’s break down the process, based on Elastic’s official documentation:

agent: fleet: enabled: true url: token:

  • Apply the configuration and deploy the add-on to your EKS cluster.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt7300dd074502dc14/68641b2bf44b170fd878aec2/elastic-agent.png,elastic-agent.png

Note: We recommend selecting configuration Override.

Airtel is strengthening security operations with Elastic’s AI-driven analytics

In a previous blog post, we covered how Airtel’s (a leading telecommunications provider) managed security services (MSS), powered by Elastic Security, provide real-time threat detection, advanced analytics, and cloud security for enterprise customers. By using SIEM, endpoint protection, cloud security, and threat intelligence, Airtel enhances proactive threat hunting and incident response. 

In this blog, we will explore AI-driven features of Elastic Security like AI Assistant, Attack Discovery, and onboarding of custom data with Automatic Import.

Elastic AI Assistant for Security: Elastic AI Assistant for Security enhances analyst efficiency by providing intelligent recommendations, automated threat hunting queries, and contextual insights. This reduces manual effort, accelerates triage, and empowers MSSPs to respond to incidents with greater precision.

Automatic Import: Automatic Import automates the development of custom data integrations with generative AI, cutting the effort needed to create and validate custom integrations from up to several days to less than 10 minutes and significantly lowering the learning curve for onboarding data.

GenAI-powered security features: Elastic Security’s GenAI features improve anomaly detection, behavioral analytics, and predictive threat modeling. With machine learning-driven insights, MSSPs can proactively mitigate risks before they result in full-scale attacks.

These capabilities enhance operational efficiency, reduce alert fatigue through automated prioritization, and ensure scalable, cost-effective security operations.

The above features offer significant benefits to Airtel MSS by enhancing their ability to deliver comprehensive security solutions to their customers, like:

  1. Enhanced threat detection and response: Elastic's Attack Discovery uses AI-driven insights to identify and respond to threats more effectively. This capability allows Airtel to detect anomalies and potential security incidents quickly, reducing the mean time to detect (MTTD) and respond (MTTR) to threats.

  2. Search AI powered insights: Elastic AI Assistant for Security provides Airtel with advanced capabilities to generate queries and visualizations, reducing the learning curve for security investigations. This tool helps analysts interactively explore problems and execute remedies using generative AI, which accelerates incident management and root cause analysis.

  3. Scalability and flexibility: Elastic's Search AI Platform is designed to handle large volumes of data, making it suitable for Airtel managing multiple clients with varying data needs. The platform's ability to ingest and analyze data from any source ensures that Airtel can provide tailored security solutions to its clients.

  4. Cost-efficiency: By consolidating multiple security tools into a single platform, Elastic helps MSSPs reduce operational costs. The unified data store eliminates the need for data rehydration, enabling long-term historical analysis and reducing storage costs.

  5. Improved collaboration and productivity: Elastic's solutions facilitate better collaboration between technical and business teams by providing a single pane of glass for security operations. This integration reduces manual troubleshooting processes and enhances productivity by automating routine tasks.

  6. Future-proofed security operations: With features like cross-cluster search and AI-driven anomaly detection, Elastic ensures that Airtel can adapt to evolving security challenges and regulatory requirements. The platform's open and extensible architecture supports seamless integration with existing technology ecosystems.

  7. Upskilling and empowerment: AI Assistant for Security helps upskill junior analysts by guiding them through detection, analysis, and remediation processes. This capability not only enhances resource efficiency but also contributes to the sustainable development of talent within Airtel organizations.

Elastic AI Assistant for Security and Attack Discovery are transforming how Airtel Secure SOC operates by drastically reducing alert fatigue and investigation timelines. Through contextual threat summarization and natural language interaction, analysts can triage and resolve alerts significantly faster.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt943e8259e83c4bd3/685e1e635f487a3d35b3ff84/altert-summary-dashboard.png,altert-summary-dashboard.png

  • Business growth enabled: 50% faster onboarding of new customers using AI-powered detection rules and prebuilt integration templates

  • Cost optimization: 25% lower operational cost per customer cluster due to Elastic’s horizontal scaling, pay-per-ingest pricing, and unified agent model
https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blte1e21b38c8e43172/685e1e9866fc096e14d3fead/elastic-managed-integrations.png,elastic-managed-integrations.pngElastic managed integrations for scalable, multi-tenant visibility

Airtel MSS uses over 100 Elastic-built integrations to expand the range of data sources of its customers. Airtel’s MSSP platform spans 30+ Elastic customer deployments, powering ingestion from diverse endpoints, firewalls, cloud services, and business systems.

  • Airtel manages multiple customer environments, ensuring data isolation and compliance.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt9766206baa9f497c/685e1ebfe3e0bbed8fa48433/compliance-dashboard.png,compliance-dashboard.png
  • Elastic’s cloud-native architecture scales dynamically, handling high-volume data ingestion without performance bottlenecks.
https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/bltbd215a2c53d6f36d/685e1ef75f487a7f1fb3ff8c/high-volume-data-ingestion.png,high-volume-data-ingestion.png

  • Onboarding automation engine: One-click deployment and agent assignment

  • Role-based access control (RBAC) for per-customer data and dashboard segregation

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use. 

Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Congratulations to our first Elastic Verified Generative AI Sales Partners

The tech industry is bursting with new tools to help teams build production-ready AI applications without requiring advanced technical knowledge. But even so, many businesses still struggle to move beyond AI pilots to scalable, secure solutions that deliver real business value. The complexity of integrating multiple AI models, managing enterprise data, and ensuring security often leaves teams stuck in endless proof-of-concept cycles.

That's exactly why we created our Verified Generative AI Partner certification.

Why choose a Verified Elastic AI Partner seller?

Our verified partners can help developers leverage the Elastic AI Ecosystem with their:

  • Deep expertise: Certified partners have proven their deep knowledge and understanding of Elastic AI technologies.

  • Strategic support: They can provide expert guidance and support throughout the implementation and optimization process.

  • Innovative approaches: Verified AI partners are at the forefront of AI innovation and can help you stay ahead of the curve.

  • Proven reliability: They’ve earned our trust with a proven track record of success and commitment to innovation, and we know they’ll earn yours, too.

Elastic Cloud Serverless now generally available on Microsoft Azure

Today, we are excited to announce the general availability of Elastic Cloud Serverless on Microsoft Azure — now available in the EastUS region. Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions without managing infrastructure. Built on the industry-first Search AI Lake architecture — which relies on Azure Blob Storage — it combines vast storage, separate storage and compute, low-latency querying, and advanced AI capabilities to deliver uncompromising speed and scale.

Elastic's journey to build Elastic Cloud Serverless

How do you take a stateful, performance-critical system like Elasticsearch and make it serverless?

At Elastic, we reimagined everything — from storage to orchestration — to build a truly serverless platform that customers can trust.

Elastic Cloud Serverless is a fully managed, cloud-native platform designed to bring the power of Elastic Stack to developers without the operational burden. In this blog post, we will walk you through why we built it, how we approached the architecture, and what we learned along the way.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt9f8cd60b46d8cd65/685d01ca9f5d27e1aa1435b8/diagram.png,diagram.pngOptimizing object store efficiency

While the shift to object storage delivered operational and durability benefits, it introduced a new challenge: object store API costs. Writes to Elasticsearch — particularly translog updates and refreshes — translate directly into object store API calls, which can scale up quickly and unpredictably, especially under high-ingestion or high-refresh workloads.

To address this, we implemented a per-node translog buffering mechanism that coalesces writes before flushing to the object store, significantly reducing write amplification. We also decoupled refreshes from object store writes, instead sending refreshed segments directly to search nodes while deferring object store persistence. This architectural refinement reduced refresh-related object store API calls by two orders of magnitude, with no compromise to data durability. For more details, please refer to this
blog post.

Managing infrastructurehttps://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt19bd38934784231e/685d01e91f43fc7d2ab17b11/managing-infrastructure.png,managing-infrastructure.png

The Unified layer is the operator-facing management layer, providing Kubernetes CRDs for service owners to manage their Kubernetes clusters. They are able to define parameters including the CSP, region, and type (explained in the next section). It enriches operators' requests and forwards them to the Management layer.

The Management layer acts as a proxy between the Unified layer and CSP APIs, transforming requests from the Unified layer to CSP resource requests and reporting the status back to the Unified layer.

In our current setup, we maintain two management Kubernetes clusters for each CSP within every environment. This dual-cluster approach primarily serves two key purposes. Firstly, it allows us to effectively address potential scalability concerns that may arise with Crossplane. Secondly, and more importantly, it enables us to use one of the clusters as a canary environment. This canary deployment strategy facilitates a phased rollout of our changes, starting with a smaller, controlled subset of each environment, minimizing risk.

The Workload layer contains all the kubernetes workload clusters running applications that users interact with (Elasticsearch, Kibana, MIS, etc.).

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt8e4fc6951b81026b/685d020712948f2738752da4/the-push-model.png,the-push-model.png

The Control Plane is the user-facing management layer. We provide UIs and APIs for users to manage their Elastic Cloud Serverless projects. This is where users can create new projects, control who has access to their projects, and get an overview of their projects.

The Data Plane is the infrastructure layer that powers the Elastic Cloud Serverless projects and that users interact with when they want to use their projects.

A fundamental design decision we faced was how the global control plane should communicate with Kubernetes clusters in the data plane. We explored two models:

  • Push Model: The control plane proactively pushes configurations to regional Kubernetes clusters.

  • Pull Model: Regional Kubernetes clusters periodically fetch configurations from the control plane.

After evaluating both approaches, we adopted the Push Model due to its simplicity, unidirectional data flow, and ability to operate Kubernetes clusters independently from the control plane during failures. This model allowed us to maintain straightforward scheduling logic while reducing operational overhead and failure recovery complexities.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt91e62aeaab98cdc5/685d0241b1848748d8921f0c/intelligent-scaling-strategy.png,intelligent-scaling-strategy.png

This layered, intelligent scaling strategy ensures performance and efficiency across diverse workloads — and it’s a big step toward a truly serverless platform.

Elastic Cloud Serverless introduces nuanced autoscaling capabilities tailored for the search tier — leveraging inputs such as boosted data windows, search power settings, and search load metrics (including thread pool load and queue load). These signals work together to define baseline configurations and trigger dynamic scaling decisions based on customer search usage patterns. For a deeper dive into search tier autoscaling, read this
blog post. To learn more about how indexing tier autoscaling works, check out this blog post.

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/bltb877b4aa825b8b3c/685d025b52f9c8c8bae90453/Usage-pipeline.png,Usage-pipeline.pngBilling pipeline

Once usage records are deposited in object storage, the billing pipeline picks up the data and turns it into quantities of ECU (Elastic Consumption Units, our currency-agnostic billing unit) that we bill for. The basic process looks like this:

https://static-www.elastic.co/v3/assets/bltefdd0b53724fa2ce/blt84a2bec3a76f3437/685d026f4641882bbf60097f/billing-pipeline.png,billing-pipeline.png

A transform process consumes the metered usage records from object storage and turns them into records that can actually be billed. This process involves unit conversion (the metered application may measure storage in bytes, but we may bill in GB), filtering out usage sources that we don't bill for, mapping the record to a specific product (this involves parsing metadata in the usage records to tie the usage to a solution-specific product that has a unique price), and sending this data to an Elasticsearch cluster which is queried by our billing engine. The purpose of this transform stage is to provide a centralized place where logic lives to convert the generic metered usage records into product-specific quantities that are ready to be priced. This enables us to keep this specialized logic out of the metered applications and the billing engine, which we want to keep simple and product-agnostic.

The billing engine then rates these billable usage records, which now contain an identifier that maps to a product in our prices database. At a minimum, this process entails summing the usage over a given period and multiplying the quantity by the product's price to compute the ECUs. In some cases, it must additionally segment the usage into tiers based on cumulative usage throughout the month and map these to individually priced product tiers. In order to tolerate delays in the upstream process without missing records, usage is billed at the time it arrives in the billable usage datastore, but it’s priced according to when it occurred (to ensure we don't apply the wrong price for usage that arrived "late"). This provides a "self-healing" capability to our billing process.

Finally, once the ECUs are computed, we assess any add-on costs (such as for support) and then feed this into the billing calculations, which ultimately result in an invoice (sent by us or one of our cloud marketplace partners). This final part of the process is not new or unique to Serverless and is handled by the same systems that bill our Hosted product.