Skip to content

Prom-Client aggregation causing master to choke in cluster mode #628

Description

@ssg2526

I implemented a small cluster module based code. I have one master and eight workers running. I am using AggregatorRegistry class like:

const AggregatorRegistry = client.AggregatorRegistry;
const aggregatorRegistry = new AggregatorRegistry();

const externalMethodTimings = new client.Histogram({
    name: 'external',
    labelNames: ['status', 'method', 'url'],
    help: 'None'
});

The worker has one end point which responds with a 30ms latency and also a middleware to increase the number of metrics to simulate the production like behaviour.

let express = require('express');
let app = express();
app.use(function(req, res, next)
{
    req.startTime = new Date().getTime();
    req.custom = req.url+"/"+counter%700
    counter++;
    next();
});

app.use(function(req, res, next){
    var startAt = req.startTime;
    res.on('finish', ()=> {
        try {
        var diff = new Date().getTime() - startAt;
        externalMethodTimings.labels({'method': req.method, 'url': req.custom}).observe(diff/1000.0);
        return;
        } catch(err){
        
        }
    });
    next();
});

app.get('/worker', function(req, res){
    setTimeout(()=>{
        return res.status(200).send('worker Server is running');
    },30);
});

app.listen(port, function () {
    console.log('worker server listening on port %d', port);
});

Now I have a master that has the following code

app.get('/master', async (req, res) => {
    try {
        var start = new Date().getTime();
        const workerMetrics = aggregatorRegistry.clusterMetrics();
		            
        workerMetrics.then(values => {
            var allMetrics = values; 
            console.log(new Date().getTime() - start);
            res.set('Content-Type', aggregatorRegistry.contentType);
            res.send(allMetrics);
        });
    } catch (e) {
	    res.statusCode = 500;
	    res.send(e.message);
    }
});

app.listen(master_port, function(){
    console.log('master server listening on port %d', master_port);
});

When I run this application and put load using apache benchmark using the below configs. Also I ran a watcher to hit the cluster metric endpoint on master at every 1 second to simulate prometheus scraping. The command is given below -

ab -n 60000 -c 50 "http://localhost:8080/worker"

watch -n 1 "curl http://localhost:3000/master > /dev/null";

I get my 95th percentile at about 190ms - 200ms and the 98th and 99th percentiles are even higher on this load. When i stop my watcher and don't hit the metric end point during the load the latencies are in the expected ranges of about 30-40ms at 95th percentile.
I tried commenting the aggregation code in the prom-client and ran the same load it ran fine again. So my understanding till this point is that the aggregation of metrics from different workers are blocking the master when the size of metrics are bigger and as the master gets choked it stops routing the requests to the workers and the workers don't get the requests which causes the latencies reported by the benchmark tool. Another thing to prove the same hypothesis is that if we see the prometheus metrics from the end point the latencies reported by the workers are always in le 50ms bucket which means that once the request reaches the worker, it always gets responded as per the expectation.
Kindly suggest a way to resolve this. Also happy to raise any PR if required

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions