Skip to content

Commit 2e5ef6f

Browse files
authored
Add instructions on how to use the redaction tool and update the language support in README (#1023)
* update overview by adding a few examples * update overview and add labels.json * remove __pycache__ * update .gitignore * add redaction tool instruction * update based on Chia-Sheng's review comments
1 parent 768b3cc commit 2e5ef6f

File tree

8 files changed

+105
-0
lines changed

8 files changed

+105
-0
lines changed

scripts/redact_cli_py/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ The OCR.json and labels.json will also be redacted while keeping the semantics o
1010
![ocr-before-after-redaction](./images/ocr-before-after-redaction.png)
1111
![labels-before-after-redaction](./images/labels-before-after-redaction.png)
1212

13+
## Language support
14+
This tool supports Latin characters redaction only. For any non-Latin document support, please [contact us](mailto:formrecog_contact@microsoft.com?subject=Redaction%20tool%20language%20support).
15+
1316
## Version
1417
Redact CLI 0.2.3
1518

102 KB
Loading
93.8 KB
Loading
116 KB
Loading
616 KB
Loading
107 KB
Loading
204 KB
Loading

scripts/redact_cli_py/instruction.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# How to use the Redaction Tool?
2+
3+
## Overview
4+
The Redaction Tool is provided by Azure Form Recognizer team to help our customers who are willing to share/donate data with Microsoft that can share/donate data efficiently without privacy concern. It focuses on Personally Identifiable Information (PII) or sensitive data labeling and redaction while keeping the semantics of these fields (e.g. length of the values, character/digit patterns, upper/lower case etc.) for prebuilt model training. This is a step-by-step instruction to guide our you how to use the tool.
5+
6+
![process-overview](./images/redaction-tool.png)
7+
8+
## Language support
9+
This tool supports Latin characters redaction only. For any non-Latin document support, please [contact us](mailto:formrecog_contact@microsoft.com?subject=Redaction%20tool%20language%20support).
10+
11+
## Prerequisites
12+
- Azure subscription - [Create one for free](https://azure.microsoft.com/free/cognitive-services)
13+
- Once you have your Azure subscription, create a [Form Recognizer resource](https://ms.portal.azure.com/#create/Microsoft.CognitiveServicesFormRecognizer) in the Azure portal to get your key and endpoint.
14+
- Create an Azure storage account with a container - following the [Azure Storage quickstart for Azure portal](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal). Use the *standard* performance tier.
15+
- [Optional] If the data cannot leave your environment, you need to follow [this document](https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/label-tool#set-up-the-sample-labeling-tool) to set up the labeling tool container.
16+
17+
## Step 1 - label the PII/sensitive fields with Sample Labeling Tool (FOTT)
18+
* Assemble your raw images based on the input requirements below and [upload them into a blob storage container](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/build-training-data-set#upload-your-training-data). Single-page PDF is supported for batch redaction only. You need to convert multi-page PDF files to images with below format before labeling.
19+
- Format must be JPG, JPEG, PNG, BMP or TIFF.
20+
- File size must be less than 50 MB.
21+
- Image dimensions must be between 50 x 50 pixels and 10000 x 10000 pixels.
22+
* Generate a Shared access (SAS) token and URL for FoTT to access the blob storage. Make sure you have selected all the options including **Read**, **Add**, **Create**, **Write**, **Delete**, **List**, **Immutable storage** in “Permissions”.
23+
![generate-sas-token](./images/SAS-token.png)
24+
* Go to [FoTT portal](https://fott-2-1.azurewebsites.net/) and create a new data connection by clicking on the left side navigation icon.
25+
![fott-connection](./images/fott-connection.png)
26+
* Go to the homepage of FoTT portal and create a new project by clicking “Use Custom to train a model with labels and get key value pairs”.
27+
![fott-project](./images/fott-project.png)
28+
* Once your project is created and opened, “Run layout on all documents” to get the OCR results with yellow bounding box first. This will make your tagging easier in later steps.
29+
![fott-run-layout](./images/fott-run-layout.png)
30+
* Only label the PII or sensitive fields (refer to [Appendix](#appendix-pii-fields-reference)) by following [this guidance](https://docs.microsoft.com/azure/applied-ai-services/form-recognizer/label-tool?tabs=v2-1#label-your-forms). Please try to leverage the OCR detected results (aka. yellow bounding boxes) as much as possible. You can multi-select these bounding boxes to associate with a tag. If the value is not detected by OCR, or you want to redact a picture like profile image, you can *draw a region* to associate the area with the tag.
31+
32+
---
33+
**NOTE**
34+
35+
It's better to label as fine-grained as possible, e.g. first name and last name in separate tags, city, zipcode, building address separately so that the redacted results can keep as much semantics as possible.
36+
37+
---
38+
39+
* Once you are done with the labeling, go to your blob storage container to check the images together with OCR and labeling results (*.ocr.json, *.labels.json).
40+
![labeling-results](./images/labeling-results.png)
41+
42+
## Step 2 – run redaction python scripts
43+
The redaction python script is open sourced. You can follow the [instructions](README.md) to set up the environment, then choose one of the options to redact your data:
44+
- Single image reaction - you need to download the image from blob storage to local machine then run `redact.py`
45+
- Batch redaction - we support batch redacting images, labels.json and ocr.json from a local folder or Azure blob storage vertial folder.
46+
47+
## Step 3 - check redacted files and share with Microsoft.
48+
Once redaction is completed, you can review and confirm if it's OK to share with Microsoft. Below is an example of an image before and after redaction.
49+
![image-before-after-redaction](./images/DL-before-after-redaction.png)
50+
51+
The OCR.json and labels.json will also be redacted while keeping the semantics of the texts (e.g. length, upper/lower case, character/digit patterns, etc.)
52+
![ocr-before-after-redaction](./images/ocr-before-after-redaction.png)
53+
![labels-before-after-redaction](./images/labels-before-after-redaction.png)
54+
55+
please consolidate all the redacted images, redacted OCR results and redacted labels and [send them to us](mailto:formrecog_contact@microsoft.com?subject=Redacted%20data%20sharing). You can choose to upload to a shared storage if needed.
56+
57+
## Contact us
58+
For any questions or feedback you have regarding this tool, please email us: formrecog_contact@microsoft.com.
59+
60+
## Appendix PII fields reference
61+
### Customer Name
62+
Names of people.
63+
For example, each of these would be marked as “Person Name.”
64+
- Abdul-Qahhar Abadi
65+
- Wesley Brooks
66+
- Jiang Li Liu
67+
- Dragoslava Simovic
68+
- Robert Downey jr.
69+
Names should also be marked if they are *partial names* or *signature*. For example, the name “Camille Cartier” may also appear as “C. Cartier” or “Cartier, Camille”. Each of these should be marked as a Person Name.
70+
#### Exception: Person Names In Context
71+
When a name includes a person name but refers to an entity other than a person this should *NOT* be marked as “Person Name”.
72+
- Places named after people: “Lincoln, Nebraska”, “Martin Luther King Jr Boulevard”, or “Camille Carter Memorial Stadium.”
73+
- If a company is named after a person, for example “Grace Owens Plumbing Co, LLC” or “Gabriel Woods Consulting, Inc” or “Huang, Carlsen, & Amaya, Inc”. Exception: Title Please DO NOT include title which doesn’t belongs to person name, e.g. Mr., Ms., Dr.
74+
#### Exception: Title
75+
Please *DO NOT* include title which doesn’t belongs to person name, e.g. Mr., Ms., Dr.
76+
77+
### Personal Email
78+
Label the full email address which includes domain name and email account.
79+
80+
### Unredacted Credit Card Number
81+
Credit/debit card number. If more than 4 digits are shown in the trailing of numbers, we treat it as unredacted form. *We only label the unredacted credit card number.*
82+
- XXXXXXXXXX53140 is an unredacted form.
83+
- A12XXXXXXXX53140 is an unredacted form.
84+
- XXXXXXXXXXX3140 is *NOT* an unredacted form.
85+
- 123XXXXXXXXX3140 is *NOT* an unredacted form.
86+
- 123XXXXXXXXX3140/XXXX is *NOT* an unredacted form.
87+
- AMEX-*3140 is *NOT* an unredacted form.
88+
89+
### Customer Address
90+
Mailing address of the customer. Please note that we will need to label the full address, including streets, cities, country and postal code.
91+
92+
### Customer License number
93+
Driver’s license number. Usually show up in parking receipt.
94+
95+
### Customer Membership Number
96+
Any kind of membership number which could reference to customer should be included.
97+
98+
### Face Photo
99+
If customer’s face photo is listed on document, please draw a region to cover the whole face.
100+
101+
### Other
102+
Any other customer-related PII but not listed in above fields, please tag them as well.

0 commit comments

Comments
 (0)