Skip to content

Commit f07e2ae

Browse files
authored
Merge pull request #19609 from henrybear327/robustness/improve_readme
Update the robustness test README
2 parents a2eedb9 + 49b4137 commit f07e2ae

File tree

1 file changed

+28
-27
lines changed

1 file changed

+28
-27
lines changed

tests/robustness/README.md

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ The purpose of these tests is to rigorously validate that etcd maintains its [KV
3939

4040
## How Robustness Tests Work
4141

42-
Robustness tests compare etcd cluster behavior against a simplified model of its expected behavior.
42+
Robustness tests compare the etcd cluster behavior against a simplified model of its expected behavior.
4343
These tests cover various scenarios, including:
4444

4545
* **Different etcd cluster setups:** Cluster sizes, configurations, and deployment topologies.
@@ -52,8 +52,8 @@ These tests cover various scenarios, including:
5252
2. **Traffic and Failures:** Client traffic is generated and sent to the cluster while failures are injected.
5353
3. **History Collection:** All client operations and their results are recorded.
5454
4. **Validation:** The collected history is validated against the etcd model and a set of validators to ensure consistency and correctness.
55-
5. **Report Generation:** If a failure is detected and a detailed report is generated to help diagnose the issue.
56-
This report includes information about the client operations, etcd data directories.
55+
5. **Report Generation:** If a failure is detected then a detailed report is generated to help diagnose the issue.
56+
This report includes information about the client operations and etcd data directories.
5757

5858
## Key Concepts
5959

@@ -96,26 +96,25 @@ Etcd provides strict serializability for KV operations and eventual consistency
9696
make gofail-disable
9797
```
9898
2. Run the tests
99-
10099
```bash
101100
make test-robustness
102101
```
103102

104-
Optionally you can pass environment variables:
103+
Optionally, you can pass environment variables:
105104
* `GO_TEST_FLAGS` - to pass additional arguments to `go test`.
106105
It is recommended to run tests multiple times with failfast enabled. this can be done by setting `GO_TEST_FLAGS='--count=100 --failfast'`.
107106
* `EXPECT_DEBUG=true` - to get logs from the cluster.
108-
* `RESULTS_DIR` - to change location where results report will be saved.
107+
* `RESULTS_DIR` - to change the location where the results report will be saved.
109108
* `PERSIST_RESULTS` - to persist the results report of the test. By default this will not be persisted in the case of a successful run.
110109

111110
## Re-evaluate existing report
112111

113112
Robustness test validation is constantly changing and improving.
114-
Errors in etcd model could be causing false positives, which makes the ability to re-evaluate the reports after we fix the issue important.
113+
Errors in the etcd model could be causing false positives, which makes the ability to re-evaluate the reports after we fix the issue important.
115114

116115
> Note: Robustness test report format is not stable, and it's expected that not all old reports can be re-evaluated using the newest version.
117116
118-
1. Identify location of the robustness test report.
117+
1. Identify the location of the robustness test report.
119118
120119
> Note: By default robustness test report is only generated for failed test.
121120
@@ -124,7 +123,7 @@ Errors in etcd model could be causing false positives, which makes the ability t
124123
logger.go:146: 2024-04-08T09:45:27.734+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessExploratory_Etcd_HighTraffic_ClusterOfSize1"}
125124
```
126125
127-
* **For remote runs on CI:** you need to go to the [Prow Dashboard](https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64), go to a build, download one of the Artifacts (`artifacts/results.zip`), and extract it locally.
126+
* **For remote runs on CI:** you need to go to the [Prow Dashboard](https://testgrid.k8s.io/sig-etcd-robustness#Summary), go to a build, download one of the Artifacts (`artifacts/results.zip`), and extract it locally.
128127
129128
![Prow job run page](readme-images/prow_job.png)
130129
@@ -144,14 +143,14 @@ Errors in etcd model could be causing false positives, which makes the ability t
144143
145144
The `testdata` directory can contain multiple robustness test reports.
146145
The name of the report directory doesn't matter, as long as it's unique to prevent clashing with reports already present in `testdata` directory.
147-
For example path for `history.html` file could look like `$REPO_ROOT/tests/robustness/testdata/v3.5_failure_24_April/history.html`.
146+
For example, the path for `history.html` file could look like `$REPO_ROOT/tests/robustness/testdata/v3.5_failure_24_April/history.html`.
148147
149148
3. Run `make test-robustness-reports` to validate all reports in the `testdata` directory.
150149
151150
## Analysing failure
152151
153-
If robustness tests fails we want to analyse the report to confirm if the issue is on etcd side. Location of the directory with the report
154-
is mentioned `Saving robustness test report` log. Logs from report generation should look like:
152+
If robustness tests fail, we want to analyse the report to confirm if the issue is on etcd side. The location of the directory with the report
153+
is mentioned in the `Saving robustness test report` log. Logs from report generation should look like:
155154
```
156155
logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550"}
157156
logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving member data dir {"member": "TestRobustnessRegressionIssue14370-test-0", "path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/server-TestRobustnessRegressionIssue14370-test-0"}
@@ -178,21 +177,21 @@ is mentioned `Saving robustness test report` log. Logs from report generation sh
178177
logger.go:146: 2024-05-08T10:42:54.441+0200 INFO Saving visualization {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/history.html"}
179178
```
180179
181-
Report follows the hierarchy:
180+
The report follows the hierarchy:
182181
* `server-*` - etcd server data directories, can be used to verify disk/memory corruption.
183182
* `member`
184183
* `wal` - Write Ahead Log (WAL) directory, that can be analysed using `etcd-dump-logs` command line tool available in `tools` directory.
185184
* `snap` - Snapshot directory, includes the bbolt database file `db`, that can be analysed using `etcd-dump-db` command line tool available in `tools` directory.
186185
* `client-*` - Client request and response dumps in json format.
187-
* `watch.jon` - Watch requests and responses, can be used to validate [watch API guarantees].
186+
* `watch.json` - Watch requests and responses, can be used to validate [watch API guarantees].
188187
* `operations.json` - KV operation history
189188
* `history.html` - Visualization of KV operation history, can be used to validate [KV API guarantees].
190189
191-
### Example analysis of linearization issue
190+
### Example analysis of a linearization issue
192191
193192
Let's reproduce and analyse robustness test report for issue [#14370].
194193
To reproduce the issue by yourself run `make test-robustness-issue14370`.
195-
After a couple of tries robustness tests should fail with a log `Linearization failed` and save report locally.
194+
After a couple of tries robustness tests should fail with a log `Linearization failed` and save the report locally.
196195

197196
Example:
198197
```
@@ -211,14 +210,14 @@ Jump to the error in linearization by clicking `[ jump to first error ]` on the
211210
You should see a graph similar to the one on the image below.
212211
![issue14370](readme-images/issue14370.png)
213212

214-
Last correct request (connected with grey line) is a `Put` request that succeeded and got revision `168`.
213+
The last correct request (connected with the grey line) is a `Put` request that succeeded and got revision `168`.
215214
All following requests are invalid (connected with red line) as they have revision `167`.
216-
Etcd guarantee that revision is non-decreasing, so this shows a bug in etcd as there is no way revision should decrease.
217-
This is consistent with the root cause of [#14370] as it was issue with process crash causing last write to be lost.
215+
Etcd guarantees that revision is non-decreasing, so this shows a bug in etcd as there is no way revision should decrease.
216+
This is consistent with the root cause of [#14370] as it was an issue with the process crash causing the last write to be lost.
218217

219218
[#14370]: https://github.com/etcd-io/etcd/issues/14370
220219

221-
### Example analysis of watch issue
220+
### Example analysis of a watch issue
222221

223222
Let's reproduce and analyse robustness test report for issue [#15271].
224223
To reproduce the issue by yourself run `make test-robustness-issue15271`.
@@ -236,22 +235,24 @@ Example:
236235
```
237236
238237
Watch issues are easiest to analyse by reading the recorded watch history.
239-
Watch history is recorded for each client separated in different subdirectory under `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806`
240-
Open `watch.json` for client mentioned in log `Broke watch guarantee`.
238+
239+
Watch history is recorded for each client separated in different subdirectory under `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806`.
240+
241+
Open `watch.json` for the client mentioned in the log `Broke watch guarantee`.
241242
For client `4` that broke the watch guarantee open `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806/client-4/watch.json`.
242243
243-
Each line consists of json blob corresponding to single watch request sent by client.
244-
Look for events with `Revision` equal to revision mentioned in the first log with `Broke watch guarantee`, in this case look for `"Revision":3,`.
244+
Each line consists of json blob corresponding to a single watch request sent by the client.
245+
Look for events with `Revision` equal to revision mentioned in the first log with `Broke watch guarantee`, in this case, look for `"Revision":3,`.
245246
You should see watch responses where the `Revision` decreases like ones below:
246247
```
247248
{"Events":[{"Type":"put-operation","Key":"key5","Value":{"Value":"793","Hash":0},"Revision":799,"IsCreate":false,"PrevValue":null}],"IsProgressNotify":false,"Revision":799,"Time":3202907249,"Error":""}
248249
{"Events":[{"Type":"put-operation","Key":"key4","Value":{"Value":"1","Hash":0},"Revision":3,"IsCreate":true,"PrevValue":null}, ...
249250
```
250251
251-
Up to the first response the `Revision` of events only increased up to a value of `799`.
252+
Up to the first response, the `Revision` of events only increased up to a value of `799`.
252253
However, the following line includes an event with `Revision` equal `3`.
253-
If you follow the `revision` throughout the file you should notice that watch replayed revisions second time.
254+
If you follow the `revision` throughout the file you should notice that watch replayed revisions for a second time.
254255
This is incorrect and breaks `Ordered` [watch API guarantees].
255-
This is consistent with the root cause of [#14370] where member reconnecting to cluster will resend revisions.
256+
This is consistent with the root cause of [#14370] where the member reconnecting to cluster will resend revisions.
256257
257258
[#15271]: https://github.com/etcd-io/etcd/issues/15271

0 commit comments

Comments
 (0)