Skip to content

Use better indexes for job history table #8573

@mkindahl

Description

@mkindahl

Background

The bgw_job_history table currently have the following indexes:

  • bgw_job_stat_history_pkey is the primary key used to index the record id (which is different from the job id)
  • bgw_job_stat_history_job_id_idx is an index used for queries on the job id (there can be many records with the same job).

Problem

Many queries of interest on the history table will involve the execution start time, for example, to get the last $1 entries for all jobs starting execution after a particular time you can use this query, which involves a qualifier on both job_id and execution_start. This kind of query is common when you want to find last N entries but ignore old jobs (to speed up the query, for example).

select
    id,
    job_id,
    execution_start,
    execution_finish,
    execution_finish - execution_start as execution_time,
    succeeded
from (
    select *,
        row_number() over (partition by job_id order by execution_start desc) as row_number
    from job_stat_history
    where
        job_id = any($2) and execution_start > $3
) t
where row_number <= $1
order by job_id, execution_time;

Other alternatives are a query only having a qualifier on job_id

select
    id,
    job_id,
    execution_start,
    execution_finish,
    execution_finish - execution_start as execution_time,
    succeeded
from (
    select *,
        row_number() over (partition by job_id order by execution_start desc) as row_number
    from job_stat_history
    where
        job_id = any($2)
) t
where row_number <= $1
order by job_id, execution_time;

and a query that has qualifier on only execution_time

select
    id,
    job_id,
    execution_start,
    execution_finish,
    execution_finish - execution_start as execution_time,
    succeeded
from (
    select *,
        row_number() over (partition by job_id order by execution_start desc) as row_number
    from job_stat_history
   where execution_start > $2
) t
where row_number <= $1
order by job_id, execution_time;

Looking for information for a particular job_id is expected to be a common ad-hoc query:

select
    id,
    job_id,
    execution_start,
    execution_finish,
    execution_finish - execution_start as execution_time,
    succeeded
from job_stat_history
where job_id = 4
order by execution_finish desc
limit 5;

Proposal

Drop the bgw_job_stat_history_job_id_idx and recreate one that involves the execution start time to allow efficient queries involving the start time.

The job_id index is not very useful, even when looking for information on a particular job_id, since the number of jobs is not very high and a sequence scan with a filter is about as efficient for this query. An index would allow a Bitmap Heap Scan, but it is marginally faster in tests.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions