Skip to content

Provide real-time service metrics #1874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rkojedzinszky
Copy link
Contributor

Service metrics are collected during scraping now, rather than prepopulating them. This allows more frequent collection, and not bound to service synchronization period anymore. For example, on a PI cluster, CPU cycles can be saved with higher synchronization intervals, howewer, metrics still can be collected real-time.

@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from 899ca46 to 9dfa6dd Compare July 2, 2025 20:44
@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from 9dfa6dd to ddf8c80 Compare July 2, 2025 21:21
@rkojedzinszky rkojedzinszky marked this pull request as draft July 3, 2025 07:46
@rkojedzinszky
Copy link
Contributor Author

I was wondering if linux namespace switches will collide with statsitics collection in different threads. Howewer, I found that existing code already takes care of it here and here. Am I right?

@rkojedzinszky rkojedzinszky marked this pull request as ready for review July 3, 2025 17:40
@rkojedzinszky rkojedzinszky marked this pull request as draft July 4, 2025 18:38
@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from ddf8c80 to cb4594b Compare July 4, 2025 19:37
@rkojedzinszky rkojedzinszky marked this pull request as ready for review July 4, 2025 19:44
@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from 6699874 to 6511070 Compare July 6, 2025 07:12
@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from 6511070 to ceb14d1 Compare July 6, 2025 07:15
@aauren
Copy link
Collaborator

aauren commented Jul 6, 2025

Thanks for submitting this @rkojedzinszky. I'm always excited to see people from the community that want to make kube-router better!

As a small note for next time, typically with larger change sets like this, it is best to open an issue ahead of time with the project just to ensure that we don't waste your development time on a feature that might not be accepted by the project.

The issue also helps because the template usually guides users through the process of describing the business value of the change a little better. In this case, what is the real-world problem that you're trying to solve?

Is it fair to say that you're seeing an adverse impact on minimal systems from IPVS metric collection when it is attached to the service update timer? Are you able to quantify this? What type of intervals do you use after this update is made? How has it impacted your cluster?

@rkojedzinszky rkojedzinszky marked this pull request as draft July 7, 2025 07:07
@rkojedzinszky
Copy link
Contributor Author

Closes #1875

@rkojedzinszky
Copy link
Contributor Author

@aauren I really appreciate your attention, I've created an issue for this. Sorry for me being hasty.

Copy link
Collaborator

@aauren aauren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this PR @rkojedzinszky!

I really appreciate the thought and time that you put into this PR and I agree with the use-case. Just because people have long (or short) service sync times, doesn't mean that it should change the granularity of the metrics produced. I really like this approach to making the metrics produce just in time.

A lot of the review (other than requesting that metric logic be pulled out of the network_services_controller.go file, is mostly questions and genuine query for your thoughts. Let me know what you think!

@@ -725,7 +720,21 @@ func (nsc *NetworkServicesController) syncIpvsFirewall() error {
return nil
}

func (nsc *NetworkServicesController) publishMetrics(serviceInfoMap serviceInfoMap) error {
func (*NetworkServicesController) Describe(ch chan<- *prometheus.Desc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over time I've been trying to decrease the size of the controller files. Lets go ahead and move this out of this file and into a dedicated metrics.go file inside the controllers/proxy package to separate the concerns a bit and reduce the complexity of this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will.


protocol = convertSvcProtoToSysCallProto(svc.protocol)
for _, ipvsSvc := range ipvsSvcs {
serviceMap := map[svcMapKey]*serviceInfo{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you did here with the map instead of having the embeded for loops!

However, this Collect() function is still a bit large. Let's go ahead and separate this serviceMap building into its own function. We could potentially separate out the second for loop into its own function as well and then just call both from Collect().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will.

@@ -115,7 +118,7 @@ type NetworkServicesController struct {
krNode utils.NodeAware
syncPeriod time.Duration
mu sync.Mutex
serviceMap serviceInfoMap
serviceMap unsafe.Pointer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is semi related to the request to move metrics out of this file. If possible, I would really like to remove the usage of unsafe pointers in this design.

After thinking about this a bit, what would you think of a pattern that looks more like the following?

type MetricsCollector struct {
    serviceMapChan chan serviceInfoMap
    metricsChan    chan MetricsRequest
}

type MetricsRequest struct {
    ServiceName string
    Response    chan ServiceMetrics
}

type ServiceMetrics struct {
    ServiceName string
    Connections int
    Packets     int
    // ... other metrics
}

func NewMetricsCollector() *MetricsCollector {
    mc := &MetricsCollector{
        serviceMapChan: make(chan serviceInfoMap, 1),
        metricsChan:    make(chan MetricsRequest),
    }
    
    go mc.metricsManager()
    return mc
}

func (mc *MetricsCollector) metricsManager() {
    var currentMap serviceInfoMap
    
    for {
        select {
        case newMap := <-mc.serviceMapChan:
            currentMap = newMap
            
        case req := <-mc.metricsChan:
            // Calculate metrics for the specific service
            if service, exists := currentMap[req.ServiceName]; exists {
                metrics := mc.calculateMetrics(service)
                req.Response <- metrics
            } else {
                req.Response <- ServiceMetrics{ServiceName: req.ServiceName}
            }
        }
    }
}

func (mc *MetricsCollector) calculateMetrics(service *serviceInfo) ServiceMetrics {
    // Implement IPVS metrics calculation here
    return ServiceMetrics{
        ServiceName: service.name,
        Connections: mc.getIPVSConnections(service),
        Packets:     mc.getIPVSPackets(service),
    }
}

func (mc *MetricsCollector) GetServiceMetrics(serviceName string) ServiceMetrics {
    responseChan := make(chan ServiceMetrics, 1)
    mc.metricsChan <- MetricsRequest{
        ServiceName: serviceName,
        Response:    responseChan,
    }
    return <-responseChan
}

func (mc *MetricsCollector) UpdateServiceMap(serviceMap serviceInfoMap) {
    mc.serviceMapChan <- serviceMap
}

This is just some pseudo-code, so don't take it as the end all be all approach to implementing a metric collector. Mostly just trying to show an approach that uses channels over atomic references.

That keeps the type safety of the serviceInfoMap, eliminates the possibility of a dangling pointer reference (even if that possibility is slim in your implementation), and is more idiomatic and testable.

Generally curious on your thoughts for this approach though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the purpose of the unsafe pointer was to simply make the pointer read/write operations atomic, and avoid using a mutex. For readability, of course a simple mutex and the real pointer could be used. I see you proposal too complex here, did not catch the point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally feel that using channels is just more idiomatic and understandable for future use-cases.

But its fine for now. If you don't mind, just add a comment above the unsafe pointer declarations with this info in it for now. We can see how it goes and refactor later if needed.

@aauren
Copy link
Collaborator

aauren commented Jul 30, 2025

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.

Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

@rkojedzinszky
Copy link
Contributor Author

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.

Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

I am bit busy now, I'll be able to work on that in the next two weeks.

@aauren
Copy link
Collaborator

aauren commented Jul 30, 2025

@rkojedzinszky I think that we're almost there with this one. We just need to separate the logic out into a metrics.go file, break up the collect function a bit, and then add a comment above the unsafe pointer declaration.
Let me know if you run into any time constraints and I can make the changes as well. I think this will be a good one to get in for the rest of the community.

I am bit busy now, I'll be able to work on that in the next two weeks.

Sounds good!

@rkojedzinszky rkojedzinszky requested a review from aauren August 10, 2025 08:33
@rkojedzinszky rkojedzinszky marked this pull request as ready for review August 10, 2025 08:34
@rkojedzinszky rkojedzinszky force-pushed the instant-service-metrics branch from 2b8a15f to 4102dd9 Compare August 10, 2025 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants