Amazon Bedrock now supports observability of First Token Latency and Quota Consumption

Amazon Bedrock now supports observability of First Token Latency and Quota Consumption

Amazon Bedrock is a fully managed service for building generative AI applications using high-performing foundation models from leading AI providers. It now supports two new CloudWatch metrics, TimeToFirstToken and EstimatedTPMQuotaUsage, giving you deeper visibility into inference performance and quota consumption.

TimeToFirstToken measures the latency from when a request is sent to when the first token is received, for streaming APIs (ConverseStream and InvokeModelWithResponseStream). You can use this metric to set CloudWatch alarms which monitor latency degradation and establish SLA baselines, without any client-side instrumentation. EstimatedTPMQuotaUsage tracks your estimated Tokens Per Minute (TPM) quota consumption, including cache write tokens and output burndown multipliers, across all inference APIs (Converse, InvokeModel, ConverseStream, and InvokeModelWithResponseStream). You can use this metric to set proactive alarms before reaching your quota limit, track your quota consumption across your models, and request further quota increases before usage is rate limited.

Both metrics are supported in all commercial Bedrock regions for models available via cross-region inference profiles and in-region inference, updated every minute for successfully completed requests. These are available in your CloudWatch out of the box; you pay only for the underlying model inference you consume, with no API changes or opt-in required.

To learn more about TimeToFirstToken and EstimatedTPMQuotaUsage, see our documentation page on Monitoring Amazon Bedrock.

Categories: marketing:marchitecture/artificial-intelligence

Source: Amazon Web Services



Latest Posts

Pass It On
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply