Skip to content

Feature Request: Add Monitoring for Azure AI Services (Foundry, AI Hubs, AI Search, OpenAI) #2655

@jkutes

Description

@jkutes

Proposal

As organizations increasingly adopt Azure's AI capabilities, comprehensive monitoring of these services becomes critical for operational stability, performance management, and cost optimization. Currently, Promitor lacks direct support for scraping metrics from several key Azure AI resource types, forcing users to rely on less integrated monitoring solutions or manual Azure Portal checks. This creates monitoring gaps and increases operational overhead for teams leveraging these services.

I would like to request the addition of support for scraping Azure Monitor metrics for the following Azure AI resource types within Promitor:

  • Azure AI Foundry

Resource Type: (To be determined by Promitor team, e.g., AzureAIFoundry)

Key Metrics (examples): Usage, Throughput, Latency, Error Rates specific to Foundry components.

  • Azure AI Hubs

Resource Type: (To be determined by Promitor team, e.g., AzureAIHubs)

Key Metrics (examples): Connection counts, Message throughput, Processing latency, Error rates for hub operations.

  • Azure AI Search (formerly Azure Cognitive Search)

Resource Type: CognitiveSearch (as identified in Promitor's existing schema)

Key Metrics (examples):

SearchLatency (Average query latency)

ThrottledQueries (Count of throttled queries)

QueryErrors (Count of query errors)

SearchQueriesPerSecond (Queries per second)

DocumentCount (Total documents)

StorageUsage (Storage consumed)

SkillExecutionCount (Number of skill executions)

  • Azure OpenAI Service

Resource Type: (To be determined by Promitor team, e.g., AzureOpenAI)

Key Metrics (examples): Token usage (input/output), Request latency, Throughput (requests/sec), Error rates (e.g., rate limit errors, internal errors), Model usage.

Integrating these Azure AI services into Promitor would significantly enhance the observability capabilities for teams building AI-powered applications on Azure. It would allow for:

Unified monitoring alongside other Azure resources in Prometheus/Grafana.

Proactive alerting on performance degradation, capacity issues, and errors.

Simplified operational management of critical AI infrastructure.

Thank you for considering this feature request.

Component

Scraper

Contact Details

juraj@hyperproof.io

Metadata

Metadata

Assignees

Labels

Projects

Status

Proposed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions