|
| 1 | +--- |
| 2 | +title: Report Caching |
| 3 | +weight: 12 |
| 4 | +slug: 0012 |
| 5 | +extra: |
| 6 | + rfd: 12 |
| 7 | + primary_author: Julian Lanson |
| 8 | + primary_author_link: https://github.com/j-lanson |
| 9 | + status: Proposed |
| 10 | +--- |
| 11 | + |
| 12 | +# Report Caching |
| 13 | + |
| 14 | +In anticipation of users running Hipcheck as a GitHub action against the |
| 15 | +dependencies of their own project, we propose to add to Hipcheck the ability to |
| 16 | +cache reports. Given the same policy file, Hipcheck binary, target repository |
| 17 | +and ref, Hipcheck is nearly deterministic across runs (a few plugins in our |
| 18 | +suite need updating to restrict their analyses not to surpass the date/time of |
| 19 | +the target ref of the target repo). Therefore, we can help drive down execution |
| 20 | +time of multi-target analyses in resource-constrained environments such as |
| 21 | +Continuous Integration (CI) runners by allowing Hipcheck to cache the report of |
| 22 | +previous runs and skip redundant execution. |
| 23 | + |
| 24 | +## Report Caching Keys |
| 25 | + |
| 26 | +We have identified the following as the keys that must be matched to have a |
| 27 | +report "cache hit": |
| 28 | +- **The policy file path and hash.** Not only does the policy file itself need |
| 29 | + to be consistent (indicated by the hash), but it must be in the same |
| 30 | + location relative to the Hipcheck binary. This is because the policy file |
| 31 | + may contain relative paths to config files for various plugins, so moving |
| 32 | + the policy file can have indirect effects on the configuration of plugins, |
| 33 | + and by extension the score a target receives. |
| 34 | +- **The Hipcheck binary hash.** Naturally, an `hc` binary that hashes to a |
| 35 | + different value can produce a different report. We choose to use the binary |
| 36 | + hash instead of the version string for uniquely identifying the Hipcheck |
| 37 | + instance to help developers who would otherwise have to manually invalidate |
| 38 | + the cache between each run (since during development the code has changed but |
| 39 | + the version has not). |
| 40 | +- **The repository and analyzed commit.** The repository is an obvious key, |
| 41 | + which would likely be represented by the repo's path in the Hipcheck target |
| 42 | + cache. We must specifically key on the git commit or tag that was analyzed |
| 43 | + rather than the ref, because refspecs such as `HEAD` can implicitly refer to |
| 44 | + different commits as the remote repository is updated. |
| 45 | + |
| 46 | +One limitation present in this keying scheme is that changes to plugin-specific |
| 47 | +config files that are referenced by the policy file ought also to invalidate a |
| 48 | +cached repo. For example, a policy file using the `mitre/binary` plugin might |
| 49 | +configure that plugin to use `Binary.kdl`. If the user changes the content of |
| 50 | +`Binary.kdl`, the analysis configuration is technically different, but the |
| 51 | +policy file path and hash have not changed. We do not currently have a strategy |
| 52 | +to address this because the policy file by design does not have contextual |
| 53 | +understanding of the configuration key/value pairs sent to each plugin. We |
| 54 | +therefore cannot know which config keys represent files that need to be checked |
| 55 | +as part of the policy keying process. |
| 56 | + |
| 57 | +Thus, it will be important for us to document clearly to users this limitation, |
| 58 | +and provide them with CLI tools for invalidating cache entries based on |
| 59 | +particular elements of the key tuple `(policy file path, policy file hash, hc |
| 60 | +binary hash, repository, commit/tag)`. |
| 61 | + |
| 62 | +## Inserting Report Caching into the Analysis Workflow |
| 63 | + |
| 64 | +Checking the report cache for an existing report should occur just after single |
| 65 | +target resolution, where we now have the concrete commit/tag information for a |
| 66 | +`LocalGitRepo` instance. Inserting a report in the cache should occur upon a |
| 67 | +successful execution of the `hipcheck/src/main.rs::run()` function in |
| 68 | +`cmd_check()`. We anticipate that this guidance will be impacted by the changes |
| 69 | +proposed by RFD11 to support multi-target seeds. It is uncertain whether the |
| 70 | +RFD11 changes or the changes proposed in this RFD will be implemented first, but |
| 71 | +in the end the report cache should be checked before we execute the analysis of |
| 72 | +a single target repository and after we receive a successful report for that analysis, |
| 73 | +regardless of how the structure of Hipcheck core changes. |
| 74 | + |
| 75 | +We will mention this below as well, but given the anticipated use of |
| 76 | +parallelization for multi-target seeds, we will need synchronization primitives |
| 77 | +to protect the report cache from race conditions. |
| 78 | + |
| 79 | +## Report Cache Design |
| 80 | + |
| 81 | +The report cache should follow the precedent of existing target and plugin |
| 82 | +caches, and thus be stored in a `report/` subdirectory of the directory |
| 83 | +identified by the `HC_CACHE` environment variable. |
| 84 | + |
| 85 | +These existing Hipcheck caches have simple keying systems with an implicit |
| 86 | +hierarchical structure. For example, a plugin's version is more specific than |
| 87 | +it's name, which is in turn more specific than its publisher. This has allowed |
| 88 | +for a simple implementation using a directory hierarchy as keys, and the "value" |
| 89 | +is a file in the lowest subdirectory. The proposed report cache keying strategy |
| 90 | +has more elements and does not have an implicit hierarchy; if we were to try |
| 91 | +using directories as keys, would the policy file hash be parent to the Hipcheck |
| 92 | +binary hash or vice versa? Where would the repository fit in? These are |
| 93 | +important questions because if we adopt a directory-based system, if we wanted |
| 94 | +to invalidate the cache based on a particular sub-key, we would have to search |
| 95 | +the entire `report/` directory recursively for paths containing that sub-key. |
| 96 | + |
| 97 | +Therefore, we propose using a simple database file, such as a SQLite database, |
| 98 | +to store a table containing all the key elements plus the filename of the |
| 99 | +report. We would likely store all the reports directly in `report/` as |
| 100 | +compressed files, plus the one database file. We also propose a simple unique ID |
| 101 | +column for the table so that entries can be referenced without specifying every |
| 102 | +key element. This unique ID would be auto-generated and assigned by the SQLite |
| 103 | +database as each entry is created. |
| 104 | + |
| 105 | +### Report Cache Synchronization |
| 106 | + |
| 107 | +With the implementation of RFD11, it will be more likely for multiple `hc check` |
| 108 | +analysis instances to be adding, deleting, and/or reading cache entries |
| 109 | +simultaneously. Due to the transaction synchronization logic it offers, using a |
| 110 | +SQLite (or comparable) database to manage the cache would help to reduce (but |
| 111 | +not entirely eliminate) the possibility of race conditions. The remaining |
| 112 | +potential race condition that we have identified is that between when an entry |
| 113 | +is created for a report and when the associated compressed file appears in the |
| 114 | +`report/` directory. As an alternative to this split-storage system, the entire |
| 115 | +compressed report could be stored as a blob in the database file. [This |
| 116 | +page](sqlite-blobs) shows whether storing blobs inside the database is |
| 117 | +effective, as a product of configured page size and blob size. We should |
| 118 | +investigate what size we expect a compressed report JSON file to take up. |
| 119 | + |
| 120 | +## Report Cache CLI |
| 121 | + |
| 122 | +As with the target and plugin caches, we propose offering users a subcommand for |
| 123 | +interfacing with the report cache, namely `hc cache report`. Users |
| 124 | +are most likely to want to clear the cache, set a different caching eviction |
| 125 | +policy, or invalidate all entries that match a certain subkey (e.g., invalidate |
| 126 | +all entries for the following policy file). We propose the following subcommand |
| 127 | +semantics: |
| 128 | + |
| 129 | +``` |
| 130 | +// Invalidate all (request user confirmation) |
| 131 | +hc cache report delete --all |
| 132 | +
|
| 133 | +// Shortcut for `invalidate --all` |
| 134 | +hc cache report clear |
| 135 | +
|
| 136 | +// Invalidate all (don't request user confirmation) |
| 137 | +hc cache report delete --all -y |
| 138 | +
|
| 139 | +// Invalidate all entries keyed by this hc binary. |
| 140 | +// If no <HC_PATH> supplied, defaults to the current |
| 141 | +// `hc` binary. |
| 142 | +hc cache report delete --hc [<HC_PATH>] |
| 143 | +
|
| 144 | +// Invalidate all entries keyed by this policy file |
| 145 | +hc cache report delete --policy <POLICY_PATH> |
| 146 | +
|
| 147 | +// Invalidate all entries keyed by this target, use |
| 148 | +// target resolution system to resolve to a repo/ref |
| 149 | +hc cache report delete --target <TARGET> [--ref <REFSPEC>] |
| 150 | +``` |
| 151 | + |
| 152 | +### Reviewing Reports Marked For Investigation |
| 153 | + |
| 154 | +We anticipate a CI workflow where users run Hipcheck against their project |
| 155 | +dependencies and mark the job as failing if any dependencies are marked for |
| 156 | +investigation. We expect the user will then do their own investigation of the |
| 157 | +marked dependency, after which they may either decide eliminate or substitute |
| 158 | +that dependency in their project or determine that the dependency is not a risk. |
| 159 | +In this latter case the user will wish to allow the CI to continue without |
| 160 | +failing for this dependency. |
| 161 | + |
| 162 | +Thus, we propose a `hc cache report reviewed` command that takes the policy file |
| 163 | +and target information as `hc check` does, but updates the `report/` database to |
| 164 | +indicate that the failed report has been reviewed and should no longer cause an |
| 165 | +alert. This would constitute an additional boolean column in the SQLite database |
| 166 | +file for each entry. The benefit of tying the "reviewed" status to a particular |
| 167 | +cache entry is that if any of the key elements change (different `hc` version, |
| 168 | +policy file, or repo commit), the "reviewed" status for the old key no longer |
| 169 | +applies, and users will be asked to re-review. |
| 170 | + |
| 171 | +The proposed CLI for this command is as follows: |
| 172 | + |
| 173 | +``` |
| 174 | +// Mark the cache entry with policy <POLICY_PATH>, target <TARGET> as reviewed. |
| 175 | +// Optionally specify <HC_PATH>, otherwise defaults to the current `hc` binary. |
| 176 | +hc cache report reviewed --policy <POLICY_PATH> -- target <TARGET> [--ref |
| 177 | +<REFSPEC>] [--hc <HC_PATH>] |
| 178 | +
|
| 179 | +// Mark the report with cache row unique ID <UNIQUE_ID> as reviewed |
| 180 | +hc cache report reviewed --id <UNIQUE_ID> |
| 181 | +``` |
| 182 | + |
| 183 | +Pseudo-code for how the "reviewed" status influences the control flow is as |
| 184 | +follows: |
| 185 | + |
| 186 | +``` |
| 187 | +if report for target exists in cache: |
| 188 | + if reviewed: |
| 189 | + no alert |
| 190 | + return |
| 191 | + else: |
| 192 | + score report |
| 193 | + if failed investigate policy |
| 194 | + mark needs investigation |
| 195 | +else: |
| 196 | + generate report |
| 197 | + cache report |
| 198 | + score report |
| 199 | + if failed investigate policy |
| 200 | + mark needs investigation |
| 201 | +``` |
| 202 | + |
| 203 | +### Report Cache Functionality in CI |
| 204 | + |
| 205 | +One rather obvious deficiency in the design we have proposed so far is how users |
| 206 | +running Hipcheck as a CI action will be able to mark dependencies as reviewed to |
| 207 | +enable the CI to move forward, given they don't have easy access to a terminal |
| 208 | +for running CLI commands. We describe a solution here, and take this opportunity |
| 209 | +to sketch out some broader aspects of Hipcheck in CI. |
| 210 | + |
| 211 | +We propose Hipcheck to act as a GitHub CI action that fails if any provided |
| 212 | +targets need investigation. This means that, provided subsequent workflow steps |
| 213 | +aren't explicitly configured to run in spite of a previous failed step, the |
| 214 | +entire job will also fail. The failure message will report to users what |
| 215 | +dependencies/targets need investigation, and each dependency's unique ID. |
| 216 | + |
| 217 | +Once the user decides the dependency is acceptable, they will use a separate |
| 218 | +manual CI action to manipulate the cache. This Hipcheck review action will take |
| 219 | +as input a comma-separated list of dependencies as a string. The review action |
| 220 | +will then split the dependency list into separate calls to `hc cache report |
| 221 | +reviewed --id <ID>` to mark the specific dependencies as reviewed. |
| 222 | + |
| 223 | +The key element to make this approach work is that the main check workflow and |
| 224 | +the manual review workflow need to pull from the same `HC_CACHE` so that the |
| 225 | +unique IDs match and "reviewed" markings in the report database from the manual |
| 226 | +review workflow are reflected in the next `hc check` run. According to the |
| 227 | +GitHub [actions documentation][workspace-docs], it appears that the path |
| 228 | +specified by the `GITHUB_WORKSPACE` environment variable is consistent between |
| 229 | +runs, in which case we can point all Hipcheck actions to the same location in |
| 230 | +this directory with the `HC_CACHE` variable. |
| 231 | + |
| 232 | +If this strategy does not work, we propose both actions to use the |
| 233 | +`actions/cache@v3` [action][cache-action] with a common key that saves and loads |
| 234 | +the contents of the designated cache directory, and ensures that `HC_CACHE` is |
| 235 | +set to that cache when `hc` is run in both actions. |
| 236 | + |
| 237 | +If users are not interested in this more strict workflow, they can mark the |
| 238 | +check action as `continue-on-error` so that an "investigate" determination on |
| 239 | +any number of dependencies does not cause the whole job to fail. |
| 240 | + |
| 241 | +[sqlite-blobs]: https://www.sqlite.org/intern-v-extern-blob.html |
| 242 | +[workspace-docs]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables |
| 243 | +[cache-action]: https://github.com/marketplace/actions/cache |
0 commit comments