This tool is a prototype that tries to scan a set of given pages and output all the Tracking technology instances found, at the moment only Cookies, butmore types will be supported in the future.
This tool is especially useful for any companies who are required by regulators to set up a baseline for their existing Cookies, but do not know how to get started.
Under the hood, this tool sort of reflects how website scanning tool such as OneTrust or ObservePoint works.
By default, The tool launches 10 headless browsers in parallel to do the scanning. Check the last part of the doc for more fine tuning options
The normal usage, 3 steps
# this will scan 300 urls in www.google.com domain
DOMAIN=google yarn start
# check all cookies found in this job
cat output/cookie-names.txt
# get cookie information, on which page this cookie is dropped and by which initiator will be output
COOKIE=_ga_A12345 yarn get-cookie
$ COOKIE=BAIDUID yarn get-cookie
Domain:.baidu.com ,print at most 100 items
=================================================================
Found in:
https://www.csdn.net//marketing.csdn.net/questions/Q2202181748074189855
its initiator is https://gsp0.baidu.com/yrwHcjSl0MgCo2Kml5_Y_D3/...
this cookie is set via set-cookie
TODO: Add diagram to replace textual description
Before reading on, be sure to check the configs under config/domain
folder, which are self explanatory.
-
You need to pass domain in the way of
DOMAIN=google
, and the tool will search for the config file under theconfig/domains
folder. The tool will add theconfig.entryList
to its crawl queue and set counter = 0 -
This step is temporarily disabled: Load previous load data so that incremantal scanning can be realized
-
Enter while loop: Pick up a task if the queue is not empty or counter less than
TASKS
tasks. -
If
config.allowUrlAsTask
is not defined, the default behavior is that urls outside of this domain are skipped, but this rule does not apply to iframe tasks. But you may always customize this behavior -
Check if the task needs to be run, if the task matches a certain regex defined in
config.limitsByRegex
and that regex can only run up to N times and N times is reached, it is skipped. If the task does not match any regex, it is normalized by dropping all its trailing parameters and the normalized instance can only run up to 10 times.
The above two methods prevent the crawler from endlessly crawling similar or the same content, which is a big waste to machine resources.
This is similar to how Observe Point handles deduplicaion, check skipThisUrl
in task.ts
for more information.
-
use
node-fetch
to fetch the original page HTML response -
patch the result by inserting
document.cookie=
hijacking logic defined indocument-cookie-interceptor.js
-
Render the html again using the modified html body
-
Get console result for that web page and search for content related to
document.cookie=
and write result to memory. -
Search for all
set-cookie
headers for each resource(.js,.css, xhr calls, etc) it loads and write result to memory -
search for
set-cookie
header for the html page itself and write result to memory -
find all
a
links andiframe
srcs -
add to task queue.
-
counter++
-
If no task left or
TASKS
tasks have been executed, write scanned data tooutput
folder
config/master-sheet.txt
contains all 3rd party cookies whose owner need to be detected.
config/document-cookie-interceptor.js
The js file used for document.cookie=
task, you may ignore it
config/deprecated.txt
Not actively used not, you may ignore it.
output/cookie-info.json
This is the source of truth of all scanning result.
output/cookie-names.txt
All cookies+domain pairs found, this is the derivative of the the cookie-info.json
for easier reading.
output/crawled-urls.txt
All urls that have been crawled.
- Install node manager NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
- Put this into your ~/.zshrc to make nvm command accessible everyone
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
- Restart your terminal and install node 22 and yarn
nvm install v22
npm install -g yarn
Note that this step needs to be done everytime you do git pull
yarn
# compile typescript to javascript, the final product is in the dist/ folder
yarn compile
yarn detect
The current repo already contains previous running result(which might not be complete),
and you may run COOKIE=$SOME_COOKIE_NAME yarn get-cookie
to get its information.
Caveat: always write it to an external file for inspection as vscode always truncates console output.
# by default only output at max 100 traces per domain
npx cross-env COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
# but you can customize the maximum
npx cross-env MAX_LINE_COUNT=150 COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
# you can also output all and do not paste to pastboard
npx cross-env ALL=true COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
this is for normal scanning, which initializes its starting pages from config.entryList
# search for all set-cookie instances
# TASKS if not defined, is equal to 300
npx cross-env TASKS=100 yarn start
This is to only crawl all urls in config.entryList
and exit
npx cross-env STOP_IMMEDIATE_PROPAGATION=true DOMAIN=https://www.example.com yarn start
This is to only crawl all urls in config.entryList
plus the iframes
and a
links on these pages, and then exits
npx cross-env STOP_PROPAGATION=true DOMAIN=google yarn start
this is for debug
# make sure POOL=1 otherwise you will get entangled result
npx cross-env POOL=1 yarn debug
# It would be nice to turn on verbose mode + devtool in the debug mode
npx cross-env POOL=1 VERBOSE=true DEV_TOOL=true yarn debug
and then open chrome://inspect
in chrome, and you will something like
In case you don't see it, click "Discover network target" and try to update like this
Prettier is not implemented for the time being. Husky precommit hook is not implemented for the time being
yarn lint
yarn lint --fix
- TASKS count does not include the tasks in the
config.entryList
- For Mac users: The code tries to close each tab after each task and destroys all browser instances after running or interruption, but in case the headless browsers are not released, search for "chrome for testing" in Mac's activity monitor and kill all the browsers. This problem does not seem to exist on Windows, as I tested on my another computer
-
Define task count -> Default is 300 tasks, but you may specify that by
TASKS=1000
-
Default Pool size is 10, but you may set by
POOL=20
, normally one browser takes up 1.2G, so if your memory can affort at least 20 * 1.2 * 1.5(buffer for system running)) = 36G of memory, 20 instances should not be a problem -
Verbose output -> set env variable:
VERBOSE=true
-
Change headless browser count - > set env variable:
POOL=$some_number
-
Show devtool -> set env variables:
DEV_TOOL=true POOL=1
-
Some domains require interacting with a blocking page before the crawler can start crawling, for example a bot detection page or the PIPL consent wall page if you visit www.booking.com from China. In the future, allowing user to write custom logic to manipulate puppeteer should be supported so that these blockers can be worked around via a few mimicked user interactions (For example, clicking some checkbox and then click ok)
-
At the moment finding Cookie is supported, but finding all other tracking technologies such as localStorage/sessionStorage/iframe etc should also be easy
-
User journey: Some cookies are only dropped in a specific condition, which most of the time requires the user to be logged-in or go through some user journey. But still, on each page, the cpature of Cookie works the same via intercepting"document.cookie=" and observeing to "set-cookie" headers.
-
A more friendly UI tool is being planned