Skip to content

Monitoring Resource Health and Performance in OCI

Monitoring is not just about dashboards—it’s about creating a predictable, stable, and visible operating environment. In Oracle Cloud Infrastructure (OCI), this means not only watching performance metrics but also building automated, proactive alerting mechanisms across compute, network, storage, and databases.

This blog covers the tools OCI provides for resource monitoring, how to set up proactive alerts, and how to design a robust operational monitoring strategy.


Why OCI Monitoring Matters

Modern OCI deployments are dynamic and distributed: virtual machines, block volumes, load balancers, object storage, autonomous databases, and more. Each layer needs visibility into its performance and availability.

Without monitoring:

  • You react after a failure occurs
  • You lack trend data to optimize capacity
  • You miss early warning signs (like CPU spikes, network errors, or high IOPS)

With monitoring:

  • You detect issues early, before impact
  • You improve MTTR (Mean Time To Resolve)
  • You support capacity planning and scaling
  • You ensure compliance and reporting

OCI Monitoring Stack Overview

Oracle provides a full suite of tools under the OCI Observability & Management umbrella:

Feature

Description

Monitoring

Collects metrics from all OCI services (Compute, DB, Network, etc.)

Logging

Captures log events from services and custom apps

Alarms

Define thresholds for metric values and trigger alerts

Notifications

Sends email, Slack, PagerDuty, or HTTPS messages when alarms fire

Service Connector Hub

Streams logs and metrics between services for automation

Resource Health

Monitors the lifecycle and operational status of OCI services


Step-by-Step: Monitoring & Alerts in OCI

1. Enable Monitoring at the Resource Level

Monitoring is enabled by default for most services like:

  • Compute (CPU, memory, disk)
  • Load Balancer (backend health, latency)
  • Autonomous DB (CPU, sessions, storage)
  • Object Storage (read/write ops, errors)

You can query these via OCI Console, CLI, SDK, or Monitoring API.


2. Use Metrics Explorer for Real-Time Analysis

  • Navigate to Monitoring > Metrics Explorer
  • Select namespace (e.g., oci_computeagent)
  • Choose metric (e.g., CpuUtilization, MemoryUtilization)
  • Apply filters (resource OCID, compartment)
  • Visualize trends in custom graphs

Use this to baseline normal behavior and identify patterns before setting alerts.


3. Create Alarms for Proactive Detection

You can create alarms that:

  • Monitor conditions (e.g., CPU > 80% for 5 mins)
  • Send notifications
  • Trigger functions or automation scripts

Example Alarm:

Query: CpuUtilization[1m]{resourceId = “ocid1.instance…”} > 85

Severity: Critical

Destination: Email or PagerDuty via Notifications Service

Alarms can be stateless (fires each time condition met) or stateful (fires only on state change).


4. Set Up Notification Destinations

OCI Notifications support:

  • Email
  • Slack
  • PagerDuty
  • Oracle Functions (for auto-scaling, tagging, shutdown)
  • Custom Webhooks

Make sure to subscribe users or automation targets to these destinations.


5. Leverage Resource Health for Status Checks

This is often overlooked. OCI Resource Health tells you:

  • Whether a compute instance is rebooting
  • If a block volume is degraded
  • If an autonomous DB has scheduled maintenance

You can query this via Console or OCI CLI:

oci health service resource-health get-instance-health-summary –instance-id <OCID>


6. Use Logging for Deeper Forensics

Combine metrics with logs for root cause analysis:

  • OS logs from Compute (via logging agent)
  • Database logs (Autonomous DB activity logs)
  • API Audit logs
  • Custom app logs

Use Log Groups, set retention, and forward logs to Object Storage or SIEM.


7. Automate with Service Connector Hub

You can build flows like:

  • If an alarm fires → forward log data → call Function → tag instance or shutdown
  • Stream logs to OCI Logging Analytics, Splunk, or Elastic

This makes your monitoring event-driven and autonomous.


Best Practices for Proactive Monitoring

  • Always monitor CPU, memory, disk, and network I/O
  • Track backend health and latency on Load Balancers
  • For Autonomous DBs, monitor session counts, CPU, storage space
  • Use Alarm suppression for maintenance windows
  • Apply naming conventions and tags to filter by environment (e.g., Prod, Dev)
  • Enable Audit Logs and Object Storage lifecycle policies for cost control

Example Monitoring Use Case: Weekly CPU Surge

You notice CPU spikes every Friday due to batch jobs. With alarms in place:

  • You get notified via Slack before users report issues
  • Logs show query plan issues
  • You adjust job scheduling or indexing strategy

This kind of proactive detection avoids business impact.


OCI offers powerful native monitoring tools, and by combining Monitoring, Alarms, Logging, and Notification Services, you can create a robust observability strategy.

Whether you’re managing Oracle EBS, ADW, OAC, or container workloads, monitoring must be treated as a first-class citizen in your OCI architecture. The key is not just capturing data, but acting on it quickly—with the right alerts, routed to the right teams.


Further Reading

Brijesh Gogia
Leave a Reply