Skip to content

Commit eefbe41

Browse files
committed
chore: write RFD for report caching
Signed-off-by: jlanson <jlanson@mitre.org>
1 parent 1aef510 commit eefbe41

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
---
2+
title: Report Caching
3+
weight: 12
4+
slug: 0012
5+
extra:
6+
rfd: 12
7+
primary_author: Julian Lanson
8+
primary_author_link: https://github.com/j-lanson
9+
status: Proposed
10+
---
11+
12+
# Report Caching
13+
14+
In anticipation of users running Hipcheck as a GitHub action against the
15+
dependencies of their own project, we propose to add to Hipcheck the ability to
16+
cache reports. Given the same policy file, Hipcheck binary, target repository
17+
and ref, Hipcheck is nearly deterministic across runs (a few plugins in our
18+
suite need updating to restrict their analyses not to surpass the date/time of
19+
the target ref of the target repo). Therefore, we can help drive down execution
20+
time of multi-target analyses in resource-constrained environments such as
21+
Continuous Integration (CI) runners by allowing Hipcheck to cache the report of
22+
previous runs and skip redundant execution.
23+
24+
## Report Caching Keys
25+
26+
We have identified the following as the keys that must be matched to have a
27+
report "cache hit":
28+
- **The policy file path and hash.** Not only does the policy file itself need
29+
to be consistent (indicated by the hash), but it must be in the same
30+
location relative to the Hipcheck binary. This is because the policy file
31+
may contain relative paths to config files for various plugins, so moving
32+
the policy file can have indirect effects on the configuration of plugins,
33+
and by extension the score a target receives.
34+
- **The Hipcheck binary hash.** Naturally, an `hc` binary that hashes to a
35+
different value can produce a different report. We choose to use the binary
36+
hash instead of the version string for uniquely identifying the Hipcheck
37+
instance to help developers who would otherwise have to manually invalidate
38+
the cache between each run (since during development the code has changed but
39+
the version has not).
40+
- **The repository and analyzed commit.** The repository is an obvious key,
41+
which would likely be represented by the repo's path in the Hipcheck target
42+
cache. We must specifically key on the git commit or tag that was analyzed
43+
rather than the ref, because refspecs such as `HEAD` can implicitly refer to
44+
different commits as the remote repository is updated.
45+
46+
One limitation present in this keying scheme is that changes to plugin-specific
47+
config files that are referenced by the policy file ought also to invalidate a
48+
cached repo. For example, a policy file using the `mitre/binary` plugin might
49+
configure that plugin to use `Binary.kdl`. If the user changes the content of
50+
`Binary.kdl`, the analysis configuration is technically different, but the
51+
policy file path and hash have not changed. We do not currently have a strategy
52+
to address this because the policy file by design does not have contextual
53+
understanding of the configuration key/value pairs sent to each plugin. We
54+
therefore cannot know which config keys represent files that need to be checked
55+
as part of the policy keying process.
56+
57+
Thus, it will be important for us to document clearly to users this limitation,
58+
and provide them with CLI tools for invalidating cache entries based on
59+
particular elements of the key tuple `(policy file path, policy file hash, hc
60+
binary hash, repository, commit/tag)`.
61+
62+
## Inserting Report Caching into the Analysis Workflow
63+
64+
Checking the report cache for an existing report should occur just after single
65+
target resolution, where we now have the concrete commit/tag information for a
66+
`LocalGitRepo` instance. Inserting a report in the cache should occur upon a
67+
successful execution of the `hipcheck/src/main.rs::run()` function in
68+
`cmd_check()`. We anticipate that this guidance will be impacted by the changes
69+
proposed by RFD11 to support multi-target seeds. It is uncertain whether the
70+
RFD11 changes or the changes proposed in this RFD will be implemented first, but
71+
in the end the report cache should be checked before we execute the analysis of
72+
a single target repository and after we receive a successful report for that analysis,
73+
regardless of how the structure of Hipcheck core changes.
74+
75+
We will mention this below as well, but given the anticipated use of
76+
parallelization for multi-target seeds, we will need synchronization primitives
77+
to protect the report cache from race conditions.
78+
79+
## Report Cache Design
80+
81+
The report cache should follow the precedent of existing target and plugin
82+
caches, and thus be stored in a `report/` subdirectory of the directory
83+
identified by the `HC_CACHE` environment variable.
84+
85+
These existing Hipcheck caches have simple keying systems with an implicit
86+
hierarchical structure. For example, a plugin's version is more specific than
87+
it's name, which is in turn more specific than its publisher. This has allowed
88+
for a simple implementation using a directory hierarchy as keys, and the "value"
89+
is a file in the lowest subdirectory. The proposed report cache keying strategy
90+
has more elements and does not have an implicit hierarchy; if we were to try
91+
using directories as keys, would the policy file hash be parent to the Hipcheck
92+
binary hash or vice versa? Where would the repository fit in? These are
93+
important questions because if we adopt a directory-based system, if we wanted
94+
to invalidate the cache based on a particular sub-key, we would have to search
95+
the entire `report/` directory recursively for paths containing that sub-key.
96+
97+
Therefore, we propose using a simple database file, such as a SQLite database,
98+
to store a table containing all the key elements plus the filename of the
99+
report. We would likely store all the reports directly in `report/` as
100+
compressed files, plus the one database file. We also propose a simple unique ID
101+
column for the table so that entries can be referenced without specifying every
102+
key element. This unique ID would be auto-generated and assigned by the SQLite
103+
database as each entry is created.
104+
105+
### Report Cache Synchronization
106+
107+
With the implementation of RFD11, it will be more likely for multiple `hc check`
108+
analysis instances to be adding, deleting, and/or reading cache entries
109+
simultaneously. Due to the transaction synchronization logic it offers, using a
110+
SQLite (or comparable) database to manage the cache would help to reduce (but
111+
not entirely eliminate) the possibility of race conditions. The remaining
112+
potential race condition that we have identified is that between when an entry
113+
is created for a report and when the associated compressed file appears in the
114+
`report/` directory. As an alternative to this split-storage system, the entire
115+
compressed report could be stored as a blob in the database file. [This
116+
page](sqlite-blobs) shows whether storing blobs inside the database is
117+
effective, as a product of configured page size and blob size. We should
118+
investigate what size we expect a compressed report JSON file to take up.
119+
120+
## Report Cache CLI
121+
122+
As with the target and plugin caches, we propose offering users a subcommand for
123+
interfacing with the report cache, namely `hc cache report`. Users
124+
are most likely to want to clear the cache, set a different caching eviction
125+
policy, or invalidate all entries that match a certain subkey (e.g., invalidate
126+
all entries for the following policy file). We propose the following subcommand
127+
semantics:
128+
129+
```
130+
// Invalidate all (request user confirmation)
131+
hc cache report delete --all
132+
133+
// Shortcut for `invalidate --all`
134+
hc cache report clear
135+
136+
// Invalidate all (don't request user confirmation)
137+
hc cache report delete --all -y
138+
139+
// Invalidate all entries keyed by this hc binary.
140+
// If no <HC_PATH> supplied, defaults to the current
141+
// `hc` binary.
142+
hc cache report delete --hc [<HC_PATH>]
143+
144+
// Invalidate all entries keyed by this policy file
145+
hc cache report delete --policy <POLICY_PATH>
146+
147+
// Invalidate all entries keyed by this target, use
148+
// target resolution system to resolve to a repo/ref
149+
hc cache report delete --target <TARGET> [--ref <REFSPEC>]
150+
```
151+
152+
### Reviewing Reports Marked For Investigation
153+
154+
We anticipate a CI workflow where users run Hipcheck against their project
155+
dependencies and mark the job as failing if any dependencies are marked for
156+
investigation. We expect the user will then do their own investigation of the
157+
marked dependency, after which they may either decide eliminate or substitute
158+
that dependency in their project or determine that the dependency is not a risk.
159+
In this latter case the user will wish to allow the CI to continue without
160+
failing for this dependency.
161+
162+
Thus, we propose a `hc cache report reviewed` command that takes the policy file
163+
and target information as `hc check` does, but updates the `report/` database to
164+
indicate that the failed report has been reviewed and should no longer cause an
165+
alert. This would constitute an additional boolean column in the SQLite database
166+
file for each entry. The benefit of tying the "reviewed" status to a particular
167+
cache entry is that if any of the key elements change (different `hc` version,
168+
policy file, or repo commit), the "reviewed" status for the old key no longer
169+
applies, and users will be asked to re-review.
170+
171+
The proposed CLI for this command is as follows:
172+
173+
```
174+
// Mark the cache entry with policy <POLICY_PATH>, target <TARGET> as reviewed.
175+
// Optionally specify <HC_PATH>, otherwise defaults to the current `hc` binary.
176+
hc cache report reviewed --policy <POLICY_PATH> -- target <TARGET> [--ref
177+
<REFSPEC>] [--hc <HC_PATH>]
178+
179+
// Mark the report with cache row unique ID <UNIQUE_ID> as reviewed
180+
hc cache report reviewed --id <UNIQUE_ID>
181+
```
182+
183+
Pseudo-code for how the "reviewed" status influences the control flow is as
184+
follows:
185+
186+
```
187+
if report for target exists in cache:
188+
if reviewed:
189+
no alert
190+
return
191+
else:
192+
score report
193+
if failed investigate policy
194+
mark needs investigation
195+
else:
196+
generate report
197+
cache report
198+
score report
199+
if failed investigate policy
200+
mark needs investigation
201+
```
202+
203+
### Report Cache Functionality in CI
204+
205+
One rather obvious deficiency in the design we have proposed so far is how users
206+
running Hipcheck as a CI action will be able to mark dependencies as reviewed to
207+
enable the CI to move forward, given they don't have easy access to a terminal
208+
for running CLI commands. We describe a solution here, and take this opportunity
209+
to sketch out some broader aspects of Hipcheck in CI.
210+
211+
We propose Hipcheck to act as a GitHub CI action that fails if any provided
212+
targets need investigation. This means that, provided subsequent workflow steps
213+
aren't explicitly configured to run in spite of a previous failed step, the
214+
entire job will also fail. The failure message will report to users what
215+
dependencies/targets need investigation, and each dependency's unique ID.
216+
217+
Once the user decides the dependency is acceptable, they will use a separate
218+
manual CI action to manipulate the cache. This Hipcheck review action will take
219+
as input a comma-separated list of dependencies as a string. The review action
220+
will then split the dependency list into separate calls to `hc cache report
221+
reviewed --id <ID>` to mark the specific dependencies as reviewed.
222+
223+
The key element to make this approach work is that the main check workflow and
224+
the manual review workflow need to pull from the same `HC_CACHE` so that the
225+
unique IDs match and "reviewed" markings in the report database from the manual
226+
review workflow are reflected in the next `hc check` run. According to the
227+
GitHub [actions documentation][workspace-docs], it appears that the path
228+
specified by the `GITHUB_WORKSPACE` environment variable is consistent between
229+
runs, in which case we can point all Hipcheck actions to the same location in
230+
this directory with the `HC_CACHE` variable.
231+
232+
If this strategy does not work, we propose both actions to use the
233+
`actions/cache@v3` [action][cache-action] with a common key that saves and loads
234+
the contents of the designated cache directory, and ensures that `HC_CACHE` is
235+
set to that cache when `hc` is run in both actions.
236+
237+
If users are not interested in this more strict workflow, they can mark the
238+
check action as `continue-on-error` so that an "investigate" determination on
239+
any number of dependencies does not cause the whole job to fail.
240+
241+
[sqlite-blobs]: https://www.sqlite.org/intern-v-extern-blob.html
242+
[workspace-docs]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables
243+
[cache-action]: https://github.com/marketplace/actions/cache

0 commit comments

Comments
 (0)