Provide real-time service metrics #1874

rkojedzinszky · 2025-07-02T19:20:39Z

Service metrics are collected during scraping now, rather than prepopulating them. This allows more frequent collection, and not bound to service synchronization period anymore. For example, on a PI cluster, CPU cycles can be saved with higher synchronization intervals, howewer, metrics still can be collected real-time.

rkojedzinszky · 2025-07-03T09:41:16Z

I was wondering if linux namespace switches will collide with statsitics collection in different threads. Howewer, I found that existing code already takes care of it here and here. Am I right?

aauren · 2025-07-06T19:31:34Z

Thanks for submitting this @rkojedzinszky. I'm always excited to see people from the community that want to make kube-router better!

As a small note for next time, typically with larger change sets like this, it is best to open an issue ahead of time with the project just to ensure that we don't waste your development time on a feature that might not be accepted by the project.

The issue also helps because the template usually guides users through the process of describing the business value of the change a little better. In this case, what is the real-world problem that you're trying to solve?

Is it fair to say that you're seeing an adverse impact on minimal systems from IPVS metric collection when it is attached to the service update timer? Are you able to quantify this? What type of intervals do you use after this update is made? How has it impacted your cluster?

rkojedzinszky · 2025-07-07T07:21:11Z

Closes #1875

rkojedzinszky · 2025-07-07T07:23:24Z

@aauren I really appreciate your attention, I've created an issue for this. Sorry for me being hasty.

aauren

Thanks again for this PR @rkojedzinszky!

I really appreciate the thought and time that you put into this PR and I agree with the use-case. Just because people have long (or short) service sync times, doesn't mean that it should change the granularity of the metrics produced. I really like this approach to making the metrics produce just in time.

A lot of the review (other than requesting that metric logic be pulled out of the network_services_controller.go file, is mostly questions and genuine query for your thoughts. Let me know what you think!

pkg/controllers/proxy/network_services_controller.go

aauren · 2025-07-20T18:42:46Z

pkg/controllers/proxy/network_services_controller.go

@@ -725,7 +720,21 @@ func (nsc *NetworkServicesController) syncIpvsFirewall() error {
 	return nil
 }

-func (nsc *NetworkServicesController) publishMetrics(serviceInfoMap serviceInfoMap) error {
+func (*NetworkServicesController) Describe(ch chan<- *prometheus.Desc) {


Over time I've been trying to decrease the size of the controller files. Lets go ahead and move this out of this file and into a dedicated metrics.go file inside the controllers/proxy package to separate the concerns a bit and reduce the complexity of this file.

Sure, I will.

aauren · 2025-07-20T18:44:46Z

pkg/controllers/proxy/network_services_controller.go


-		protocol = convertSvcProtoToSysCallProto(svc.protocol)
-		for _, ipvsSvc := range ipvsSvcs {
+	serviceMap := map[svcMapKey]*serviceInfo{}


I like what you did here with the map instead of having the embeded for loops!

However, this Collect() function is still a bit large. Let's go ahead and separate this serviceMap building into its own function. We could potentially separate out the second for loop into its own function as well and then just call both from Collect().

Sure, I will.

pkg/controllers/proxy/service_endpoints_sync.go

aauren · 2025-07-20T18:51:30Z

pkg/controllers/proxy/network_services_controller.go

@@ -115,7 +118,7 @@ type NetworkServicesController struct {
 	krNode              utils.NodeAware
 	syncPeriod          time.Duration
 	mu                  sync.Mutex
-	serviceMap          serviceInfoMap
+	serviceMap          unsafe.Pointer


This comment is semi related to the request to move metrics out of this file. If possible, I would really like to remove the usage of unsafe pointers in this design.

After thinking about this a bit, what would you think of a pattern that looks more like the following?

type MetricsCollector struct { serviceMapChan chan serviceInfoMap metricsChan chan MetricsRequest } type MetricsRequest struct { ServiceName string Response chan ServiceMetrics } type ServiceMetrics struct { ServiceName string Connections int Packets int // ... other metrics } func NewMetricsCollector() *MetricsCollector { mc := &MetricsCollector{ serviceMapChan: make(chan serviceInfoMap, 1), metricsChan: make(chan MetricsRequest), } go mc.metricsManager() return mc } func (mc *MetricsCollector) metricsManager() { var currentMap serviceInfoMap for { select { case newMap := <-mc.serviceMapChan: currentMap = newMap case req := <-mc.metricsChan: // Calculate metrics for the specific service if service, exists := currentMap[req.ServiceName]; exists { metrics := mc.calculateMetrics(service) req.Response <- metrics } else { req.Response <- ServiceMetrics{ServiceName: req.ServiceName} } } } } func (mc *MetricsCollector) calculateMetrics(service *serviceInfo) ServiceMetrics { // Implement IPVS metrics calculation here return ServiceMetrics{ ServiceName: service.name, Connections: mc.getIPVSConnections(service), Packets: mc.getIPVSPackets(service), } } func (mc *MetricsCollector) GetServiceMetrics(serviceName string) ServiceMetrics { responseChan := make(chan ServiceMetrics, 1) mc.metricsChan <- MetricsRequest{ ServiceName: serviceName, Response: responseChan, } return <-responseChan } func (mc *MetricsCollector) UpdateServiceMap(serviceMap serviceInfoMap) { mc.serviceMapChan <- serviceMap }

This is just some pseudo-code, so don't take it as the end all be all approach to implementing a metric collector. Mostly just trying to show an approach that uses channels over atomic references.

That keeps the type safety of the serviceInfoMap, eliminates the possibility of a dangling pointer reference (even if that possibility is slim in your implementation), and is more idiomatic and testable.

Generally curious on your thoughts for this approach though.

Here the purpose of the unsafe pointer was to simply make the pointer read/write operations atomic, and avoid using a mutex. For readability, of course a simple mutex and the real pointer could be used. I see you proposal too complex here, did not catch the point.

I personally feel that using channels is just more idiomatic and understandable for future use-cases.

But its fine for now. If you don't mind, just add a comment above the unsafe pointer declarations with this info in it for now. We can see how it goes and refactor later if needed.

aauren · 2025-07-30T15:53:23Z

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.

Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

rkojedzinszky · 2025-07-30T18:17:53Z

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.

Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

I am bit busy now, I'll be able to work on that in the next two weeks.

aauren · 2025-07-30T22:34:41Z

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.
Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

I am bit busy now, I'll be able to work on that in the next two weeks.

Sounds good!

rkojedzinszky force-pushed the instant-service-metrics branch from 899ca46 to 9dfa6dd Compare July 2, 2025 20:44

feat(nsc): prepare serviceMap to be accessed by collector thread

68a9710

rkojedzinszky force-pushed the instant-service-metrics branch from 9dfa6dd to ddf8c80 Compare July 2, 2025 21:21

rkojedzinszky marked this pull request as draft July 3, 2025 07:46

rkojedzinszky marked this pull request as ready for review July 3, 2025 17:40

rkojedzinszky marked this pull request as draft July 4, 2025 18:38

rkojedzinszky added 3 commits July 4, 2025 21:32

feat(nsc): collect service statistics on demand

418178e

feat(nsc): eliminate nested loops in Collect()

cbc2d80

feat(nsc): improve Service statistics

cb4594b

rkojedzinszky force-pushed the instant-service-metrics branch from ddf8c80 to cb4594b Compare July 4, 2025 19:37

rkojedzinszky marked this pull request as ready for review July 4, 2025 19:44

rkojedzinszky force-pushed the instant-service-metrics branch from 6699874 to 6511070 Compare July 6, 2025 07:12

feat(nsc): optimize key in temporary serviceMap

ceb14d1

rkojedzinszky force-pushed the instant-service-metrics branch from 6511070 to ceb14d1 Compare July 6, 2025 07:15

rkojedzinszky marked this pull request as draft July 7, 2025 07:07

aauren requested changes Jul 20, 2025

View reviewed changes

feat(nsc): move metrics logic to separate file

24c1196

rkojedzinszky requested a review from aauren August 10, 2025 08:33

rkojedzinszky marked this pull request as ready for review August 10, 2025 08:34

rkojedzinszky added 3 commits August 10, 2025 10:36

feat(nsc): move part of Collect() to getMetricsServiceMap()

94eee6d

feat(nsc): replace unsafe.Pointer with atomic.Pointer

eae4ea1

feat(nsc): getMetricsServiceMap() rebuilds only after services changed

4102dd9

rkojedzinszky force-pushed the instant-service-metrics branch from 2b8a15f to 4102dd9 Compare August 10, 2025 08:36

Provide real-time service metrics #1874

Are you sure you want to change the base?

Provide real-time service metrics #1874

Conversation

rkojedzinszky commented Jul 2, 2025

Uh oh!

rkojedzinszky commented Jul 3, 2025

Uh oh!

aauren commented Jul 6, 2025

Uh oh!

rkojedzinszky commented Jul 7, 2025

Uh oh!

rkojedzinszky commented Jul 7, 2025

Uh oh!

aauren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aauren Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

rkojedzinszky Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

aauren Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

rkojedzinszky Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aauren Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

rkojedzinszky Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

aauren Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

aauren commented Jul 30, 2025

Uh oh!

rkojedzinszky commented Jul 30, 2025

Uh oh!

aauren commented Jul 30, 2025

Uh oh!

Uh oh!