Skip to content

tokyojava/cookie-detector

Repository files navigation

Cookie Crawler

This tool is a prototype that tries to scan a set of given pages and output all the Tracking technology instances found, at the moment only Cookies, butmore types will be supported in the future.

This tool is especially useful for any companies who are required by regulators to set up a baseline for their existing Cookies, but do not know how to get started.

Under the hood, this tool sort of reflects how website scanning tool such as OneTrust or ObservePoint works.

By default, The tool launches 10 headless browsers in parallel to do the scanning. Check the last part of the doc for more fine tuning options

The normal usage, 3 steps

# this will scan 300 urls in www.google.com domain
DOMAIN=google yarn start 

# check all cookies found in this job
cat output/cookie-names.txt

# get cookie information, on which page this cookie is dropped and by which initiator will be output
COOKIE=_ga_A12345 yarn get-cookie

An example output

$ COOKIE=BAIDUID yarn get-cookie

Domain:.baidu.com ,print at most 100 items
=================================================================
Found in:
https://www.csdn.net//marketing.csdn.net/questions/Q2202181748074189855
its initiator is https://gsp0.baidu.com/yrwHcjSl0MgCo2Kml5_Y_D3/...
this cookie is set via set-cookie

How it works

TODO: Add diagram to replace textual description

Before reading on, be sure to check the configs under config/domain folder, which are self explanatory.

  • You need to pass domain in the way of DOMAIN=google, and the tool will search for the config file under the config/domains folder. The tool will add the config.entryList to its crawl queue and set counter = 0

  • This step is temporarily disabled: Load previous load data so that incremantal scanning can be realized

  • Enter while loop: Pick up a task if the queue is not empty or counter less than TASKS tasks.

  • If config.allowUrlAsTask is not defined, the default behavior is that urls outside of this domain are skipped, but this rule does not apply to iframe tasks. But you may always customize this behavior

  • Check if the task needs to be run, if the task matches a certain regex defined in config.limitsByRegex and that regex can only run up to N times and N times is reached, it is skipped. If the task does not match any regex, it is normalized by dropping all its trailing parameters and the normalized instance can only run up to 10 times.

The above two methods prevent the crawler from endlessly crawling similar or the same content, which is a big waste to machine resources. This is similar to how Observe Point handles deduplicaion, check skipThisUrl in task.ts for more information.

  • use node-fetch to fetch the original page HTML response

  • patch the result by inserting document.cookie= hijacking logic defined in document-cookie-interceptor.js

  • Render the html again using the modified html body

  • Get console result for that web page and search for content related to document.cookie= and write result to memory.

  • Search for all set-cookie headers for each resource(.js,.css, xhr calls, etc) it loads and write result to memory

  • search for set-cookie header for the html page itself and write result to memory

  • find all a links and iframe srcs

  • add to task queue.

  • counter++

  • If no task left or TASKS tasks have been executed, write scanned data to output folder

Config folder

config/master-sheet.txt contains all 3rd party cookies whose owner need to be detected.

config/document-cookie-interceptor.js The js file used for document.cookie= task, you may ignore it

config/deprecated.txt Not actively used not, you may ignore it.

Output folder

output/cookie-info.json This is the source of truth of all scanning result.

output/cookie-names.txt All cookies+domain pairs found, this is the derivative of the the cookie-info.json for easier reading.

output/crawled-urls.txt All urls that have been crawled.

How to setup

Install nodejs v22

  • Install node manager NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
  • Put this into your ~/.zshrc to make nvm command accessible everyone
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
  • Restart your terminal and install node 22 and yarn
nvm install v22
npm install -g yarn

Install dependencies

Note that this step needs to be done everytime you do git pull

yarn 
# compile typescript to javascript, the final product is in the dist/ folder
yarn compile 

Functionalities

Check how many cookies we found in mastersheet and how many dintinct cookies have been found

yarn detect

Get Cookie Info

The current repo already contains previous running result(which might not be complete), and you may run COOKIE=$SOME_COOKIE_NAME yarn get-cookie to get its information. Caveat: always write it to an external file for inspection as vscode always truncates console output.

# by default only output at max 100 traces per domain
npx cross-env COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

# but you can customize the maximum
npx cross-env MAX_LINE_COUNT=150 COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

# you can also output all and do not paste to pastboard
npx cross-env ALL=true COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

Run/Debug cookie scanning

this is for normal scanning, which initializes its starting pages from config.entryList

# search for all set-cookie instances
# TASKS if not defined, is equal to 300
npx cross-env TASKS=100 yarn start

This is to only crawl all urls in config.entryList and exit

npx cross-env STOP_IMMEDIATE_PROPAGATION=true DOMAIN=https://www.example.com yarn start

This is to only crawl all urls in config.entryList plus the iframes and a links on these pages, and then exits

npx cross-env STOP_PROPAGATION=true DOMAIN=google yarn start

this is for debug

# make sure POOL=1 otherwise you will get entangled result
npx cross-env POOL=1 yarn debug

# It would be nice to turn on verbose mode + devtool in the debug mode
npx cross-env POOL=1 VERBOSE=true DEV_TOOL=true yarn debug

and then open chrome://inspect in chrome, and you will something like alt text In case you don't see it, click "Discover network target" and try to update like this alt text

Linting

Prettier is not implemented for the time being. Husky precommit hook is not implemented for the time being

yarn lint
yarn lint --fix

Caveats

  • TASKS count does not include the tasks in the config.entryList
  • For Mac users: The code tries to close each tab after each task and destroys all browser instances after running or interruption, but in case the headless browsers are not released, search for "chrome for testing" in Mac's activity monitor and kill all the browsers. This problem does not seem to exist on Windows, as I tested on my another computer

Fine tunings

  • Define task count -> Default is 300 tasks, but you may specify that by TASKS=1000

  • Default Pool size is 10, but you may set by POOL=20, normally one browser takes up 1.2G, so if your memory can affort at least 20 * 1.2 * 1.5(buffer for system running)) = 36G of memory, 20 instances should not be a problem

  • Verbose output -> set env variable: VERBOSE=true

  • Change headless browser count - > set env variable: POOL=$some_number

  • Show devtool -> set env variables: DEV_TOOL=true POOL=1

Future Opportunities

  1. Some domains require interacting with a blocking page before the crawler can start crawling, for example a bot detection page or the PIPL consent wall page if you visit www.booking.com from China. In the future, allowing user to write custom logic to manipulate puppeteer should be supported so that these blockers can be worked around via a few mimicked user interactions (For example, clicking some checkbox and then click ok) alt text

  2. At the moment finding Cookie is supported, but finding all other tracking technologies such as localStorage/sessionStorage/iframe etc should also be easy

  3. User journey: Some cookies are only dropped in a specific condition, which most of the time requires the user to be logged-in or go through some user journey. But still, on each page, the cpature of Cookie works the same via intercepting"document.cookie=" and observeing to "set-cookie" headers.

  4. A more friendly UI tool is being planned

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published