fix(observability): Aggregate metrics #5174

ktff · 2020-11-22T18:46:29Z

This is the more sensitive, disruptive part, so I would like to address it separately from the larger change of adding the label to all of the internal metrics.

PR aggregates same named and same tagged Counter metrics with different origin in the same component. Where events_processed_total and processed_bytes_total are aggregated:

those metrics that have origin the same as component_type, have priority
max value is chosen among alternatives of the same priority

, and the rest are aggregated by sum.

Open questions

@leebenson does this change conflict with something?
How should we name this label? Currently it's origin.

Signed-off-by: ktf <krunotf@gmail.com>

leebenson · 2020-11-23T14:41:52Z

Apologies @ktff, I'm playing with this locally and got trigger-happy with the 'push' on a master merge that was required to fix the jemalloc issue on macOS.

It shouldn't materially change the PR but it wasn't intentional.

leebenson · 2020-11-23T15:02:55Z

@MOZGIII will likely have more constructive feedback on the use of origin as part of our wider internal metrics strategy, but these changes don't appear to have impacted the metrics reported by vector top / the GraphQL API so superficially, this LGTM 👍

ktff · 2020-11-24T12:58:41Z

@leebenson no problem.

MOZGIII · 2020-11-27T10:21:47Z

It looks like it is too early to thing about a generic algorithm for all of the metrics that we gather - until we have an actual use case that would force us to think and understand how to use the metrics, we simply don't have the basis to root our assumptions on how the metrics should be processed. I hope this would be somewhat resolved after we take care of the grafana dashboard (#4838).

Speaking of events_processed_total and bytes_processed_total, I think that the right logic for these two fields, in particular, would be to only include metrics values where the component_type matches the origin.
Those would be the counters maintained by the sources/transforms/sinks that are executing as top-level components, rather than being included in other components as building blocks. The rest (where component_type does not match the origin) would be things like json_parser transform included into the kubernetes_logs source - for each even originating from the kubernetes_logs source two counters will be bumped - the events_processed_total{component_type="kubernetes_logs", origin="kubernetes_logs"}, and events_processed_total{component_type="json_parser", origin="kubernetes_logs"} (for the wrapped json_parser component). We should ignore the second counter when showing stats to the users.

This problem needs further analysis. Quite possible we should prohibit the use of transforms as parts of other components as-is, and instead extract the core implementations of the reusable transforms into a set of building blocks that both respective transforms and other components can use. Then we'll only implement the emit!s in the transforms level, rather than it being at the building block level.

MOZGIII

I don't think this approach is right. Maybe I don't understand what we're trying to solve here. I've explained my understanding of the problem above.
I'm looking forward to discussing this in more detail if needed.

ktff · 2020-11-27T14:58:56Z

@MOZGIII I'l try to explain how I've imagined origin to be used to see where the issue is.

origin is a label added by hand to internal metrics to their counter! call site.
Currently none has been added.
origin is required only for those components that reuse other components, and only for those metrics that shouldn't be double counted. Currently those are events_processed_total and bytes_processed_total. (EDIT: although requiring origin seams better choice, mentioned later down)
for other cases origin is optional. Most that we can gain from them is extra information if they are being reused in some other components.

This can be achieved by aggregating counters for metrics that are the equal except in origin. For those metrics that shouldn't be double counted we choose one of the values, and the rest we add up. origin label is removed in the process.

It looks like it is too early to thing about a generic algorithm for all of the metrics that we gather

I agree, but we can at least solve the double counting for those two metrics for current case.

to only include metrics values where the component_type matches the origin.

That would be clearer, and we can relax that by only requiring origin for events_processed_total and bytes_processed_total. For other metrics it isn't needed.

For json_parser in kubernetes_logs labels would then be events_processed_total{component_type="kubernetes_logs", origin="json_parser"} and it would be discarded.

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

MOZGIII · 2020-11-27T16:01:34Z

Got it, what you outlined matches what I have in mind. Great!

This can be achieved by aggregating counters for metrics that are the equal except in origin. For those metrics that shouldn't be double counted we choose one of the values, and the rest we add up. origin label is removed in the process.

Yep, I saw this is what the code does currently, however it's not clear to me how do we pick the right one of all those counters. Looks like currently, it's an arbitrary pick, which doesn't really solve the issue - we might present the wrong values to the user.

With the component_type == origin logic this issue should be resolved.

That would be clearer, and we can relax that by only requiring origin for events_processed_total and bytes_processed_total. For other metrics it isn't needed.

For json_parser in kubernetes_logs labels would then be events_processed_total{component_type="kubernetes_logs", origin="json_parser"} and it would be discarded.

Sounds good!

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

Precisely because of the duplicating emits. We have emits that we sort of tied to the component level entities (events_processed_total, bytes_processed_total, etc) - the problematic ones, and the "implementation-specific" metrics that are not common among a higher architectural layer - those I think can safely be emitted when wrapped with other components. We just don't have a clear separation between the transforms and the code intended for reuse - and this is a bit problematic in the context of metrics.

As a side note, I realized there's no json_parser reuse in the kubernetes_logs source anymore (but there is still, to the very least, a regex parser reuse, so the problem remains).

ktff · 2020-11-27T17:03:52Z

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

Precisely because of the duplicating emits.

Then we are not just talking of prohibiting use of transforms, but also of sinks and sources. For extreme examples there are new_relic_logs and honeycomb sinks which are just wrappers of http sink. And this reminded me why I've opted to not require component_type == origin, since we would need to change http and sinks like those to expose event to higher components so they could do the emit.

That's why for scenarios of where there are no metrics with component_type == origin and multiple of them, we select max value. It is arbitrary in a sense that it assumes no events are being generated between multiple emit! points, if that holds then all of the counters would be the same or some would we smaller if some filtering happened so max value is probably what we want, but yea, I realized now that it's more arbitrary than I thought.

Talking about this gave me an idea. We can handle metrics events_processed_total and bytes_processed_total in a different way if we uphold and change few things.

A source/sink can wrap only one source/sink respectively. Wrapped source/sink can wrap other source/sink.
For transforms we raise emitting of events_processed_total to topology level.
If source/sink/shared code exposes a way to encode/decode event then it's the responsibility of wrapper to emit those metrics. Otherwise it's their responsibility. Or said in other way, it's the responsibility of that which makes bytes -> event or event -> bytes transition to emit those events.

Second is a simple change, and for first I don't know of any one source/sink that uses more than one, and if such case do arise we can deal with it later. The third point is the problematic one since it has fuzzy borders.

This would achieve that only one events_processed_total and bytes_processed_total emit sites exists in sources and sinks, but also in transforms since they aren't emitting events_processed_total, and there are no transforms that emits bytes_processed_total.

No origin label or messing with metrics is needed, although it's somewhat all over the place.

@MOZGIII what do you think?

ktff · 2020-12-03T21:35:13Z

Ok, so I think the current approach in the PR is solid with its' main problem being the choice of counter when we don't have component_type == origin. For reason that some components don't have access to events so they can't emit their own metrics, we can't require component_type == origin, but if we have only one choice of counter that isn't an issue. So in the current PR the problematic point is this line:
https://github.com/timberio/vector/blob/6a78450f14008dd02aaab8ece773c7f2d97e02a9/src/api/schema/metrics/mod.rs#L294-L295
when we have multiple choices with component_type == origin metrics or with component_type != origin metrics. For that we can ban it, by reporting an error that this should be fixed, additionally we can choose a sensible value until it's fixed, for that I think max is fine.

Plus, to ease the problem we can raise emitting of events_processed_total out of transform implementations to topological level. With that we only need to worry about metrics in sources that wrap other sources, and sinks that wrap other sinks.

@MOZGIII what do you think? Do you see some additional issues?

MOZGIII · 2020-12-03T22:10:16Z

Plus, to ease the problem we can raise emitting of events_processed_total out of transform implementations to topological level. With that we only need to worry about metrics in sources that wrap other sources, and sinks that wrap other sinks.

I think this was explored already and there were issues with this approach. CC @lukesteensen

Ok, so I think the current approach in the PR is solid with its' main problem being the choice of counter when we don't have component_type == origin. For reason that some components don't have access to events so they can't emit their own metrics, we can't require component_type == origin, but if we have only one choice of counter that isn't an issue. So in the current PR the problematic point is this line:
https://github.com/timberio/vector/blob/6a78450f14008dd02aaab8ece773c7f2d97e02a9/src/api/schema/metrics/mod.rs#L294-L295

when we have multiple choices with component_type == origin metrics or with component_type != origin metrics. For that we can ban it, by reporting an error that this should be fixed, additionally we can choose a sensible value until it's fixed, for that I think max is fine.

Yeeah, for the lack of a better alternative - I guess this is fine for now.

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true). With this we would be able to reliably filter out all wrapped = true labels. Implementing this will, however, require a manual audit to figure out where do we still have nested components - and wrapping the invocations with spans (as in tracing::span!).

ktff · 2020-12-03T22:15:50Z

I think this was explored already and there were issues with this approach. CC @lukesteensen

I guess the inline transforms will be an issue.

Yeeah, for the lack of a better alternative - I guess this is fine for now.

Yep, that's also my stance on this.

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true)

I'll explore if we can somehow automate this.

ktff · 2020-12-03T22:36:44Z

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true)

I'll explore if we can somehow automate this.

It can at least be made pretty ergonomic, so if entering/exiting span for every event for every reused component isn't too costly this seams like the best option. Although I guess inline transforms will have this same problem/cost of spans so it won't be nothing new.

@MOZGIII I agree, if you are for it we can try this approach.

MOZGIII · 2020-12-03T22:46:29Z

if entering/exiting span for every event for every reused component isn't too costly

This I'm a bit worried about, and it seems like it might be relatively costly based on our other observations. We can try and compare the benches before/after.

Other than that, I think this is the only correct solution on the table so far - so I am for it.

ktff · 2020-12-06T22:07:01Z

Closing in favor of #5424

ktff added 3 commits November 22, 2020 18:12

Aggregate

9f72742

Signed-off-by: ktf <krunotf@gmail.com>

Add tests

43ce4e0

Signed-off-by: ktf <krunotf@gmail.com>

Update naming

e6aa32f

Signed-off-by: ktf <krunotf@gmail.com>

ktff added the domain: observability Anything related to monitoring/observing Vector label Nov 22, 2020

ktff requested a review from leebenson November 22, 2020 18:46

ktff self-assigned this Nov 22, 2020

Merge branch 'master' into ktff/origin_label

6a78450

ktff requested a review from MOZGIII November 24, 2020 12:57

MOZGIII suggested changes Nov 27, 2020

View reviewed changes

jamtur01 added this to the 2020-12-07: Nanite Repair System milestone Dec 3, 2020

ktff mentioned this pull request Dec 6, 2020

fix(observability): Emit wrapped metrics for reused components #5424

Closed

3 tasks

ktff closed this Dec 6, 2020

binarylogic deleted the ktff/origin_label branch April 23, 2021 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(observability): Aggregate metrics #5174

fix(observability): Aggregate metrics #5174

ktff commented Nov 22, 2020 •

edited

Loading

leebenson commented Nov 23, 2020

leebenson commented Nov 23, 2020

ktff commented Nov 24, 2020

MOZGIII commented Nov 27, 2020

MOZGIII left a comment •

edited

Loading

ktff commented Nov 27, 2020 •

edited

Loading

MOZGIII commented Nov 27, 2020 •

edited

Loading

ktff commented Nov 27, 2020 •

edited

Loading

ktff commented Dec 3, 2020 •

edited

Loading

MOZGIII commented Dec 3, 2020

ktff commented Dec 3, 2020 •

edited

Loading

ktff commented Dec 3, 2020

MOZGIII commented Dec 3, 2020 •

edited

Loading

ktff commented Dec 6, 2020

fix(observability): Aggregate metrics #5174

fix(observability): Aggregate metrics #5174

Conversation

ktff commented Nov 22, 2020 • edited Loading

Open questions

leebenson commented Nov 23, 2020

leebenson commented Nov 23, 2020

ktff commented Nov 24, 2020

MOZGIII commented Nov 27, 2020

MOZGIII left a comment • edited Loading

Choose a reason for hiding this comment

ktff commented Nov 27, 2020 • edited Loading

MOZGIII commented Nov 27, 2020 • edited Loading

ktff commented Nov 27, 2020 • edited Loading

ktff commented Dec 3, 2020 • edited Loading

MOZGIII commented Dec 3, 2020

ktff commented Dec 3, 2020 • edited Loading

ktff commented Dec 3, 2020

MOZGIII commented Dec 3, 2020 • edited Loading

ktff commented Dec 6, 2020

ktff commented Nov 22, 2020 •

edited

Loading

MOZGIII left a comment •

edited

Loading

ktff commented Nov 27, 2020 •

edited

Loading

MOZGIII commented Nov 27, 2020 •

edited

Loading

ktff commented Nov 27, 2020 •

edited

Loading

ktff commented Dec 3, 2020 •

edited

Loading

ktff commented Dec 3, 2020 •

edited

Loading

MOZGIII commented Dec 3, 2020 •

edited

Loading