Skip to content

fix(observability): Aggregate metrics #5174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

fix(observability): Aggregate metrics #5174

wants to merge 4 commits into from

Conversation

ktff
Copy link
Contributor

@ktff ktff commented Nov 22, 2020

First part of solution for #4454

This is the more sensitive, disruptive part, so I would like to address it separately from the larger change of adding the label to all of the internal metrics.

PR aggregates same named and same tagged Counter metrics with different origin in the same component. Where events_processed_total and processed_bytes_total are aggregated:

  • those metrics that have origin the same as component_type, have priority
  • max value is chosen among alternatives of the same priority

, and the rest are aggregated by sum.

Open questions

  • @leebenson does this change conflict with something?
  • How should we name this label? Currently it's origin.

ktff added 3 commits November 22, 2020 18:12
Signed-off-by: ktf <krunotf@gmail.com>
Signed-off-by: ktf <krunotf@gmail.com>
Signed-off-by: ktf <krunotf@gmail.com>
@ktff ktff added the domain: observability Anything related to monitoring/observing Vector label Nov 22, 2020
@ktff ktff requested a review from leebenson November 22, 2020 18:46
@ktff ktff self-assigned this Nov 22, 2020
@leebenson
Copy link
Contributor

Apologies @ktff, I'm playing with this locally and got trigger-happy with the 'push' on a master merge that was required to fix the jemalloc issue on macOS.

It shouldn't materially change the PR but it wasn't intentional.

@leebenson
Copy link
Contributor

@MOZGIII will likely have more constructive feedback on the use of origin as part of our wider internal metrics strategy, but these changes don't appear to have impacted the metrics reported by vector top / the GraphQL API so superficially, this LGTM 👍

@ktff ktff requested a review from MOZGIII November 24, 2020 12:57
@ktff
Copy link
Contributor Author

ktff commented Nov 24, 2020

@leebenson no problem.

@MOZGIII
Copy link
Contributor

MOZGIII commented Nov 27, 2020

It looks like it is too early to thing about a generic algorithm for all of the metrics that we gather - until we have an actual use case that would force us to think and understand how to use the metrics, we simply don't have the basis to root our assumptions on how the metrics should be processed. I hope this would be somewhat resolved after we take care of the grafana dashboard (#4838).

Speaking of events_processed_total and bytes_processed_total, I think that the right logic for these two fields, in particular, would be to only include metrics values where the component_type matches the origin.
Those would be the counters maintained by the sources/transforms/sinks that are executing as top-level components, rather than being included in other components as building blocks. The rest (where component_type does not match the origin) would be things like json_parser transform included into the kubernetes_logs source - for each even originating from the kubernetes_logs source two counters will be bumped - the events_processed_total{component_type="kubernetes_logs", origin="kubernetes_logs"}, and events_processed_total{component_type="json_parser", origin="kubernetes_logs"} (for the wrapped json_parser component). We should ignore the second counter when showing stats to the users.

This problem needs further analysis. Quite possible we should prohibit the use of transforms as parts of other components as-is, and instead extract the core implementations of the reusable transforms into a set of building blocks that both respective transforms and other components can use. Then we'll only implement the emit!s in the transforms level, rather than it being at the building block level.

Copy link
Contributor

@MOZGIII MOZGIII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this approach is right. Maybe I don't understand what we're trying to solve here. I've explained my understanding of the problem above.
I'm looking forward to discussing this in more detail if needed.

@ktff
Copy link
Contributor Author

ktff commented Nov 27, 2020

@MOZGIII I'l try to explain how I've imagined origin to be used to see where the issue is.

  • origin is a label added by hand to internal metrics to their counter! call site.
  • Currently none has been added.
  • origin is required only for those components that reuse other components, and only for those metrics that shouldn't be double counted. Currently those are events_processed_total and bytes_processed_total. (EDIT: although requiring origin seams better choice, mentioned later down)
  • for other cases origin is optional. Most that we can gain from them is extra information if they are being reused in some other components.

This can be achieved by aggregating counters for metrics that are the equal except in origin. For those metrics that shouldn't be double counted we choose one of the values, and the rest we add up. origin label is removed in the process.


It looks like it is too early to thing about a generic algorithm for all of the metrics that we gather

I agree, but we can at least solve the double counting for those two metrics for current case.

to only include metrics values where the component_type matches the origin.

That would be clearer, and we can relax that by only requiring origin for events_processed_total and bytes_processed_total. For other metrics it isn't needed.

For json_parser in kubernetes_logs labels would then be events_processed_total{component_type="kubernetes_logs", origin="json_parser"} and it would be discarded.

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

@MOZGIII
Copy link
Contributor

MOZGIII commented Nov 27, 2020

Got it, what you outlined matches what I have in mind. Great!

This can be achieved by aggregating counters for metrics that are the equal except in origin. For those metrics that shouldn't be double counted we choose one of the values, and the rest we add up. origin label is removed in the process.

Yep, I saw this is what the code does currently, however it's not clear to me how do we pick the right one of all those counters. Looks like currently, it's an arbitrary pick, which doesn't really solve the issue - we might present the wrong values to the user.

With the component_type == origin logic this issue should be resolved.

That would be clearer, and we can relax that by only requiring origin for events_processed_total and bytes_processed_total. For other metrics it isn't needed.

For json_parser in kubernetes_logs labels would then be events_processed_total{component_type="kubernetes_logs", origin="json_parser"} and it would be discarded.

Sounds good!

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

Precisely because of the duplicating emits. We have emits that we sort of tied to the component level entities (events_processed_total, bytes_processed_total, etc) - the problematic ones, and the "implementation-specific" metrics that are not common among a higher architectural layer - those I think can safely be emitted when wrapped with other components. We just don't have a clear separation between the transforms and the code intended for reuse - and this is a bit problematic in the context of metrics.


As a side note, I realized there's no json_parser reuse in the kubernetes_logs source anymore (but there is still, to the very least, a regex parser reuse, so the problem remains).

@ktff
Copy link
Contributor Author

ktff commented Nov 27, 2020

we should prohibit the use of transforms as parts of other components as-is,

Why would that be a problem?

Precisely because of the duplicating emits.

Then we are not just talking of prohibiting use of transforms, but also of sinks and sources. For extreme examples there are new_relic_logs and honeycomb sinks which are just wrappers of http sink. And this reminded me why I've opted to not require component_type == origin, since we would need to change http and sinks like those to expose event to higher components so they could do the emit.

That's why for scenarios of where there are no metrics with component_type == origin and multiple of them, we select max value. It is arbitrary in a sense that it assumes no events are being generated between multiple emit! points, if that holds then all of the counters would be the same or some would we smaller if some filtering happened so max value is probably what we want, but yea, I realized now that it's more arbitrary than I thought.


Talking about this gave me an idea. We can handle metrics events_processed_total and bytes_processed_total in a different way if we uphold and change few things.

  1. A source/sink can wrap only one source/sink respectively. Wrapped source/sink can wrap other source/sink.
  2. For transforms we raise emitting of events_processed_total to topology level.
  3. If source/sink/shared code exposes a way to encode/decode event then it's the responsibility of wrapper to emit those metrics. Otherwise it's their responsibility. Or said in other way, it's the responsibility of that which makes bytes -> event or event -> bytes transition to emit those events.

Second is a simple change, and for first I don't know of any one source/sink that uses more than one, and if such case do arise we can deal with it later. The third point is the problematic one since it has fuzzy borders.

This would achieve that only one events_processed_total and bytes_processed_total emit sites exists in sources and sinks, but also in transforms since they aren't emitting events_processed_total, and there are no transforms that emits bytes_processed_total.

No origin label or messing with metrics is needed, although it's somewhat all over the place.

@MOZGIII what do you think?

@ktff
Copy link
Contributor Author

ktff commented Dec 3, 2020

Ok, so I think the current approach in the PR is solid with its' main problem being the choice of counter when we don't have component_type == origin. For reason that some components don't have access to events so they can't emit their own metrics, we can't require component_type == origin, but if we have only one choice of counter that isn't an issue. So in the current PR the problematic point is this line:
https://github.com/timberio/vector/blob/6a78450f14008dd02aaab8ece773c7f2d97e02a9/src/api/schema/metrics/mod.rs#L294-L295
when we have multiple choices with component_type == origin metrics or with component_type != origin metrics. For that we can ban it, by reporting an error that this should be fixed, additionally we can choose a sensible value until it's fixed, for that I think max is fine.

Plus, to ease the problem we can raise emitting of events_processed_total out of transform implementations to topological level. With that we only need to worry about metrics in sources that wrap other sources, and sinks that wrap other sinks.

@MOZGIII what do you think? Do you see some additional issues?

@MOZGIII
Copy link
Contributor

MOZGIII commented Dec 3, 2020

Plus, to ease the problem we can raise emitting of events_processed_total out of transform implementations to topological level. With that we only need to worry about metrics in sources that wrap other sources, and sinks that wrap other sinks.

I think this was explored already and there were issues with this approach. CC @lukesteensen

Ok, so I think the current approach in the PR is solid with its' main problem being the choice of counter when we don't have component_type == origin. For reason that some components don't have access to events so they can't emit their own metrics, we can't require component_type == origin, but if we have only one choice of counter that isn't an issue. So in the current PR the problematic point is this line:
https://github.com/timberio/vector/blob/6a78450f14008dd02aaab8ece773c7f2d97e02a9/src/api/schema/metrics/mod.rs#L294-L295

when we have multiple choices with component_type == origin metrics or with component_type != origin metrics. For that we can ban it, by reporting an error that this should be fixed, additionally we can choose a sensible value until it's fixed, for that I think max is fine.

Yeeah, for the lack of a better alternative - I guess this is fine for now.

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true). With this we would be able to reliably filter out all wrapped = true labels. Implementing this will, however, require a manual audit to figure out where do we still have nested components - and wrapping the invocations with spans (as in tracing::span!).

@ktff
Copy link
Contributor Author

ktff commented Dec 3, 2020

I think this was explored already and there were issues with this approach. CC @lukesteensen

I guess the inline transforms will be an issue.

Yeeah, for the lack of a better alternative - I guess this is fine for now.

Yep, that's also my stance on this.

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true)

I'll explore if we can somehow automate this.

@ktff
Copy link
Contributor Author

ktff commented Dec 3, 2020

We should probably just wrap the "nested components" invocation with an extra span, adding a new label (for instance wrapped = true)

I'll explore if we can somehow automate this.

It can at least be made pretty ergonomic, so if entering/exiting span for every event for every reused component isn't too costly this seams like the best option. Although I guess inline transforms will have this same problem/cost of spans so it won't be nothing new.

@MOZGIII I agree, if you are for it we can try this approach.

@MOZGIII
Copy link
Contributor

MOZGIII commented Dec 3, 2020

if entering/exiting span for every event for every reused component isn't too costly

This I'm a bit worried about, and it seems like it might be relatively costly based on our other observations. We can try and compare the benches before/after.

Other than that, I think this is the only correct solution on the table so far - so I am for it.

@ktff
Copy link
Contributor Author

ktff commented Dec 6, 2020

Closing in favor of #5424

@ktff ktff closed this Dec 6, 2020
@binarylogic binarylogic deleted the ktff/origin_label branch April 23, 2021 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants