1
- # PyWebCopy © ; 5
1
+ # PyWebCopy © ; 6
2
2
3
3
` Created By : Raja Tomar `
4
4
` License : MIT `
5
5
` Email: rajatomar788@gmail.com `
6
6
7
- Web Scraping and Saving Complete webpages and websites with python.
7
+ Python websites and webpages cloning at ease.
8
+ Web Scraping or Saving Complete webpages and websites with python.
8
9
9
10
Web scraping and archiving tool written in Python
10
11
Archive any online website and its assets, css, js and
@@ -14,9 +15,13 @@ It's easy with `pywebcopy`.
14
15
Why it's great? because it -
15
16
16
17
- respects ` robots.txt `
17
- - have a single-function basic usages
18
+ - saves a webpage with css, js and images with one call
19
+ - clones a complete website with assets and links remapped in one call
20
+ - have direct apis for simplicity and ease
21
+ - subclassing for advanced usage
22
+ - custom html tags handler support
18
23
- lots of configuration for many custom needs
19
- - provides several scraping packages in one Objects (thanks to their original owners)
24
+ - provides several scraping packages in one objects (thanks to their original owners)
20
25
- beautifulsoup4
21
26
- lxml
22
27
- requests
@@ -37,12 +42,12 @@ You are ready to go. Read the tutorials below to get started.
37
42
38
43
## First steps
39
44
40
- You should always check if the pywebcopy is installed successfully.
45
+ You should always check if the latest pywebcopy is installed successfully.
41
46
42
47
``` python
43
48
>> > import pywebcopy
44
49
>> > pywebcopy.__version___
45
- 5 . x
50
+ 6 . x
46
51
```
47
52
48
53
Your version may be different, now you can continue the tutorial.
@@ -54,10 +59,12 @@ To save any single page, just type in python console
54
59
``` Python
55
60
from pywebcopy import save_webpage
56
61
62
+ kwargs = {' project_name' : ' some-fancy-name' }
57
63
58
64
save_webpage(
59
65
url = ' http://example-site.com/index.html' ,
60
- project_folder = ' path/to/downloads'
66
+ project_folder = ' path/to/downloads' ,
67
+ ** kwargs
61
68
)
62
69
```
63
70
@@ -66,15 +73,18 @@ To save full website (This could overload the target server, So, be careful)
66
73
``` Python
67
74
from pywebcopy import save_website
68
75
76
+ kwargs = {' project_name' : ' some-fancy-name' }
77
+
69
78
save_website(
70
79
url = ' http://example-site.com/index.html' ,
71
80
project_folder = ' path/to/downloads' ,
81
+ ** kwargs
72
82
)
73
83
```
74
84
75
85
### 1.2.1 Running Tests
76
86
Running tests is simple and doesn't require any external library.
77
- Just run this command from root directory of pywebcopy package
87
+ Just run this command from root directory of pywebcopy package.
78
88
79
89
80
90
``` shell
@@ -89,24 +99,24 @@ from pywebcopy import WebPage
89
99
url = ' http://example-site.com/index.html' or None
90
100
project_loc = ' path/to/downloads/folder'
91
101
92
- wp = WebPage(url,
93
- project_folder
94
- default_encoding = None ,
95
- HTML = None ,
96
- ** configKwargs
97
- )
102
+ wp = WebPage()
98
103
99
104
# You can choose to load the page explicitly using
100
105
# `requests` module
101
106
wp.get(url, ** requestsKwargs)
102
107
108
+ # OR
109
+ # You can choose to set the source yourself
110
+ handle = open (' file.html' , ' rb' )
111
+ wp.set_source(handle)
112
+
103
113
# if you want assets only
104
114
wp.save_assets()
105
115
106
116
# if you want html only
107
117
wp.save_html()
108
118
109
- # if you want complete webpage
119
+ # if you want complete webpage with css, js and images
110
120
wp.save_complete()
111
121
```
112
122
@@ -171,6 +181,7 @@ then check if website allows scraping of its content.
171
181
>> > pywebcopy.config[' bypass_robots' ] = True
172
182
173
183
# rest of your code follows..
184
+
174
185
```
175
186
176
187
# ## Overwrite existing files when copying
@@ -183,6 +194,7 @@ use the over_write config key.
183
194
>> > pywebcopy.config[' over_write' ] = True
184
195
185
196
# rest of your code follows..
197
+
186
198
```
187
199
188
200
# ## Changing your project name
@@ -196,6 +208,7 @@ below
196
208
>> > pywebcopy.config[' project_name' ] = ' my_project'
197
209
198
210
# rest of your code follows..
211
+
199
212
```
200
213
201
214
# # How to - Save Single Webpage
@@ -204,28 +217,42 @@ Particular webpage can be saved easily using the following methods.
204
217
205
218
Note: if you get `pywebcopy.exceptions.AccessError` when running any of these code then use the code provided on later sections.
206
219
207
- # ## Method 1
220
+ # ## Method 1 : via api - `save_webpage()`
208
221
209
222
Webpage can easily be saved using an inbuilt funtion called `.save_webpage()` which takes several
210
223
arguments also.
211
224
212
225
```python
213
- >> > import pywebcopy
214
- >> > pywebcopy. save_webpage(project_url = ' http://google.com' , project_folder = ' c://Saved_Webpages/' ,)
226
+ >> > from pywebcopy import save_webpage
227
+ >> > save_webpage(project_url = ' http://google.com' , project_folder = ' c://Saved_Webpages/' ,)
215
228
216
- # rest of your code follows..
217
229
```
218
230
219
231
# ## Method 2
220
232
221
- This use case is slightly more powerful as it can provide every functionallity of the WebPage
222
- data class .
233
+ This use case is slightly more powerful as it can provide every functionallity of the WebPage class .
223
234
224
235
```python
225
- >> > from pywebcopy import Webpage
236
+ >> > from pywebcopy import Webpage, config
237
+ >> > url = ' http://some-url.com/some-page.html'
238
+
239
+ # You should always start with setting up the config or use apis
240
+ >> > config.setup_config(url, project_folder, project_name, ** kwargs)
241
+
242
+ # Create a instance of the webpage object
243
+ >> > wp = Webpage()
244
+
245
+ # If you want to use `requests` to fetch the page then
246
+ >> > wp.get(url)
247
+
248
+ # Else if you want to use plain html or urllib then use
249
+ >> > wp.set_source(object_which_have_a_read_method, encoding = encoding)
250
+ >> > wp.url = url # you need to do this if you are using set_source()
226
251
227
- >> > wp = WebPage( ' http://google.com ' , ' e://tests/ ' , project_name = ' Google ' )
252
+ # Then you can access several methods like
228
253
>> > wp.save_complete()
254
+ >> > wp.save_html()
255
+ >> > wp.save_assets()
229
256
230
257
# This Webpage object contains every methods of the Webpage() class and thus
231
258
# can be reused for later usages.
@@ -242,44 +269,50 @@ One feature is that the raw html is now also accepted.
242
269
243
270
```python
244
271
245
- >> > from pywebcopy import Webpage
272
+ >> > from pywebcopy import Webpage, config
246
273
247
274
>> > HTML = open (' test.html' ).read()
248
275
249
276
>> > base_url = ' http://example.com' # used as a base for downloading imgs, css, js files.
250
277
>> > project_folder = ' /saved_pages/'
278
+ >> > config.setup_config(base_url, project_folder)
251
279
252
- >> > wp = WebPage(base_url, project_folder, HTML = HTML )
280
+ >> > wp = WebPage()
281
+ >> > wp.set_source(HTML )
282
+ >> > wp.url = base_url
253
283
>> > wp.save_webpage()
284
+
254
285
```
255
286
256
- # # How to - Whole Websites
287
+ # # How to - Clone Whole Websites
257
288
258
289
Use caution when copying websites as this can overload or damage the
259
290
servers of the site and rarely could be illegal, so check everything before
260
291
you proceed.
261
292
262
- # ## Method 1 -
293
+ # ## Method 1 : via api - `save_website()`
263
294
264
295
Using the inbuilt api `.save_website()` which takes several arguments.
265
296
266
297
```python
267
- >> > import pywebcopy
298
+ >> > from pywebcopy import save_website
299
+
300
+ >> > save_website(project_url = ' http://localhost:8000' , project_folder = ' e://tests/' )
268
301
269
- >> > pywebcopy.save_website(project_url = ' http://localhost:8000' , project_folder = ' e://tests/' )
270
302
```
271
303
272
304
# ## Method 2 -
273
305
274
306
By creating a Crawler() object which provides several other functions as well.
275
307
276
308
```python
277
- >> > import pywebcopy
309
+ >> > from pywebcopy import Crawler, config
278
310
279
- >> > pywebcopy. config.setup_config(project_url = ' http://localhost:5000/' , project_folder = ' e://tests/' , project_name = ' LocalHost' )
311
+ >> > config.setup_config(project_url = ' http://localhost:5000/' , project_folder = ' e://tests/' , project_name = ' LocalHost' )
280
312
281
- >> > crawler = pywebcopy. Crawler(' http://localhost:5000/' )
313
+ >> > crawler = Crawler(' http://localhost:5000/' )
282
314
>> > crawler.crawl()
315
+
283
316
```
284
317
285
318
# # Contribution
@@ -296,33 +329,36 @@ If you have any suggestions or fixes or reports feel free to mail me :)
296
329
297
330
`pywebcopy` is highly configurable.
298
331
299
- # ## 1.3.1 Direct Call Method
332
+ # ## 1.3.1 APIS
300
333
301
- To change any configuration, just pass it to the `init ` call.
334
+ To change any configuration, just pass it to the `api ` call.
302
335
303
336
Example:
304
337
305
338
```Python
306
- from pywebcopy.core import save_webpage
339
+ from pywebcopy import save_webpage
340
+
341
+ kwargs = {
342
+ ' key1' : ' value1' ,
343
+ ...
344
+ }
307
345
308
346
save_webpage(
309
347
310
348
url = ' http://some-site.com/' , # required
311
349
download_loc = ' path/to/downloads/' , # required
312
350
313
- # config keys are case-insensitive
314
- any_config_key = ' new_value' ,
315
- another_config_key = ' another_new_value' ,
351
+ kwargs = kwargs
316
352
317
353
...
318
354
319
355
# add many as you want :)
320
356
)
357
+
321
358
```
322
359
323
360
# ## 1.3.2 `config.setup_config` Method
324
361
325
- > ** This function is changed from `core.setup_config` **
326
362
327
363
You can manually configure every configuration by using a
328
364
`config.setup_config` call.
@@ -378,12 +414,6 @@ below is the list of `config` keys with their `default` values :
378
414
# delete the project folder after making zip archive of it
379
415
' delete_project_folder' : False
380
416
381
- # which parser to use when parsing pages
382
- # for speed choose 'html.parser' (will crack some webpages)
383
- # for exact webpage copy choose 'html5lib' (a little slow)
384
- # or you can leave it to default 'lxml' (balanced)
385
- ' PARSER' : ' lxml'
386
-
387
417
# to download css file or not
388
418
' LOAD_CSS' : True
389
419
@@ -398,10 +428,7 @@ below is the list of `config` keys with their `default` values :
398
428
' OVER_WRITE' : False
399
429
400
430
# list of allowed file extensions
401
- ' ALLOWED_FILE_EXT' : [' .html' , ' .css' , ' .json' , ' .js' ,
402
- ' .xml' ,' .svg' , ' .gif' , ' .ico' ,
403
- ' .jpeg' , ' .jpg' , ' .png' , ' .ttf' ,
404
- ' .eot' , ' .otf' , ' .woff' ]
431
+ ' ALLOWED_FILE_EXT' : [' .html' , ' .css' , ... ]
405
432
406
433
# log file path
407
434
' LOG_FILE' : None
@@ -425,6 +452,7 @@ below is the list of `config` keys with their `default` values :
425
452
426
453
# bypass the robots.txt restrictions
427
454
' BYPASS_ROBOTS' : False
455
+
428
456
```
429
457
430
458
told you there were plenty of `config` vars available!
0 commit comments