1
- # PyWebCopy © ;
1
+ # PyWebCopy © ; 2.0(beta)
2
2
3
3
` Created By : Raja Tomar `
4
4
` License : MIT `
5
5
6
+ Mirrors Complete webpages with python.
7
+
6
8
Website mirroring and archiving tool written in Python
7
9
Archive any online website and its assets, css, js and
8
10
images for offilne reading, storage or whatever reasons.
@@ -17,56 +19,89 @@ Why it's great? because it -
17
19
18
20
Email me at ` rajatomar788@gmail.com ` of any query :)
19
21
20
- ## Installation
22
+ ## 1.1 Installation
21
23
22
24
` pywebcopy ` is available on PyPi and is easily installable using ` pip `
23
25
24
26
``` Python
25
27
pip install pywebcopy
26
28
```
27
29
28
- ## Basic Usages
30
+ ## 1.2 Basic Usages
29
31
32
+ ### 1.2.1 Direct Function Methods
30
33
To mirror any single page, just type in python console
31
34
32
35
``` Python
33
- from pywebcopy.core import init
36
+ from pywebcopy.core import save_webpage
37
+
34
38
35
- init(url = ' http://example-site.com/index.html' )
39
+ save_webpage(
40
+ url = ' http://example-site.com/index.html' ,
41
+ download_loc = ' path/to/downloads'
42
+ )
36
43
```
37
44
38
45
To mirror full website (This could overload the target server, So, be careful)
39
46
40
47
``` Python
41
- from pywebcopy.core import init
48
+ from pywebcopy.core import save_webpage
42
49
43
- init(
50
+
51
+ save_webpage(
44
52
url = ' http://example-site.com/index.html' ,
45
- copy_all = True
46
- )
53
+ download_loc = ' path/to/downloads' ,
54
+ copy_all = True
55
+ )
56
+ ```
57
+
58
+ ### 1.2.2 Object Creation Method
59
+
60
+ ``` Python
61
+ from pywebcopy.structures import WebPage
62
+
63
+ url = ' http://example-site.com/index.html'
64
+ download_loc = ' path/to/downloads/folder'
65
+
66
+ wp = WebPage(url, download_loc)
67
+
68
+ # if you want assets only
69
+ wp.save_assets_only()
70
+
71
+ # if you want html only
72
+ wp.save_html_only()
73
+
74
+ # if you want complete webpage
75
+ wp.save_complete()
76
+
77
+ # bonus : you can also use any beautiful_soup methods on it
78
+ links = wp.find_all(' a' , href = True )
79
+
47
80
```
48
81
49
82
that's it.
50
83
51
- You will now have a folder in C: drive
52
- ` C:\WebCopyProjects\example-site.com\example-site.com\ `
84
+ You will now have a folder at ` download_loc ` with all the webpage and its linked files ready to be used.
53
85
54
86
Just browse it as would on any browser!
55
87
56
- ## Configuration
88
+ ## 1.3 Configuration
57
89
58
90
` pywebcopy ` is highly configurable.
59
91
92
+ ### 1.3.1 Direct Call Method
93
+
60
94
To change any configuration, just pass it to the ` init ` call.
61
95
62
96
Example:
63
97
64
98
``` Python
65
- from pywebcopy.core import init
99
+ from pywebcopy.core import save_webpage
66
100
67
- init (
101
+ save_webpage (
68
102
69
103
url = ' http://some-site.com/' , # required
104
+ download_loc = ' path/to/downloads/' , # required
70
105
71
106
# config keys are case-insensitive
72
107
any_config_key = ' new_value' ,
@@ -78,14 +113,54 @@ init(
78
113
)
79
114
```
80
115
116
+ ### 1.3.2 ` core.setup_config ` Method
117
+
118
+ You can manually configure every configuration by using a
119
+ ` core.setup_config ` call.
120
+
121
+ ``` Python
122
+
123
+ import pywebcopy
124
+
125
+ url = ' http://example-site.com/index.html'
126
+ download_loc = ' path/to/downloads/'
127
+
128
+ pywebcopy.core.setup_config(url, download_loc)
129
+
130
+ # done!
131
+
132
+ >> > pywebcopy.config.config[' url' ]
133
+ ' http://example-site.com/index.html'
134
+
135
+ >> > pywebcopy.config.config[' mirrors_dir' ]
136
+ ' path/to/downloads'
137
+
138
+ >> > pywebcopy.config.config[' project_name' ]
139
+ ' example-site.com'
140
+
141
+
142
+ # # You can also change any of these by just adding param to
143
+ # # `setup_config` call
144
+
145
+ >> > pywebcopy.core.setup_config(url,
146
+ download_loc,project_name = ' Your-Project' , ... )
147
+
148
+ # # You can also change any config even after
149
+ # # the `setup_config` call
150
+
151
+ pywebcopy.config.config[' url' ] = ' http://url-changed.com'
152
+ # rest of config remains unchanged
153
+
154
+ ```
155
+
81
156
Done!
82
157
83
- ### List of available ` configurations `
158
+ ### 1.3.3 List of available ` configurations `
84
159
85
160
below is the list of ` config ` keys with their ` default ` values :
86
161
87
162
``` Python
88
- # writes the log file content to console directly
163
+ # writes the trace output and log file content to console directly
89
164
' DEBUG' : False
90
165
91
166
# make zip archive of the downloaded content
@@ -118,9 +193,6 @@ below is the list of `config` keys with their `default` values :
118
193
' .jpeg' , ' .jpg' , ' .png' , ' .ttf' ,
119
194
' .eot' , ' .otf' , ' .woff' ]
120
195
121
- # file to write all valid links found on pages
122
- ' LINK_INDEX_FILE' : None
123
-
124
196
# log file path
125
197
' LOG_FILE' : None
126
198
@@ -142,7 +214,7 @@ below is the list of `config` keys with their `default` values :
142
214
' URL' : None
143
215
144
216
# define the base directory to store all copied sites data
145
- ' MIRRORS_DIR' : C: / WebCopyProjects / + Project_Name
217
+ ' MIRRORS_DIR' : None
146
218
147
219
# all downloaded file location
148
220
# available after any project completion
@@ -159,16 +231,15 @@ below is the list of `config` keys with their `default` values :
159
231
' FILENAME_VALIDATION_PATTERN' : re.compile(r ' [*":<> \|\? ]+ ' )
160
232
161
233
# user agent to be shown on requests made to server
162
- ' USER_AGENT' : Mozilla/ 4.0 (compatible; WebCopyBot/ X.X;
163
- + Non- Harmful- LightWeight)
234
+ ' USER_AGENT' : Mozilla/ 5.0 (compatible; WebCopyBot/ X.X;)
164
235
165
236
# bypass the robots.txt restrictions
166
237
' BYPASS_ROBOTS' : False
167
238
```
168
239
169
240
told you there were plenty of ` config ` vars available!
170
241
171
- ## Help
242
+ ## 1.4 Help
172
243
173
244
For any queries related to this project you can email me at
174
245
` rajatomar788@gmail.com `
@@ -181,7 +252,7 @@ You can help in many ways:
181
252
182
253
Thanks!
183
254
184
- ## Undocumented Features
255
+ ## 1.5 Undocumented Features
185
256
186
257
I built many utils and classes in this project to ease
187
258
the tasks I was trying to do.
@@ -192,14 +263,20 @@ these task are also suitable for general purpose use.
192
263
So,
193
264
if you want, you can help in generating suitable ` documentation ` for these undocumented ones, then you can always email me.
194
265
195
- ## Changelog
266
+ ## 1.6 Changelog
196
267
197
- ### [ version 1.9]
268
+ ### [ version 2.0(beta)]
269
+
270
+ - ` init ` function is replaced with ` save_webpage `
271
+ - three new ` config ` automation functions are added -
272
+ - ` core.setup_config ` (creates every ideal config just from url and download location)
273
+ - ` config.reset_config ` (resets the configuration to default state)
274
+ - ` config.update_config ` (manual-mode version of ` core.setup_config ` )
275
+ - object ` structures.WebPage ` added
276
+ - merged ` generators.generate_style_map ` and ` generators.generate_relative_paths ` to a single function ` generators.generate_style_map `
277
+ - rewrite of majority of functions
278
+ - new module ` exceptions ` added
198
279
199
- - more redundant code
200
- - modules are now separated based on type e.g. Core, Generators, Utils etc.
201
- - new helper functions and class ` structures.WebPage `
202
- - Compatible with Python 2.6, 2.7, 3.6, 3.7
203
280
204
281
### [ version 1.10]
205
282
@@ -208,3 +285,10 @@ if you want, you can help in generating suitable `documentation` for these undoc
208
285
- ` init ` call now takes ` url ` arg by default and could raise a error when not supplied
209
286
- professional looking log entries
210
287
- rewritten archiving system now uses ` zipfile ` and ` exceptions ` handling to prevent errors and eventual archive corruption
288
+
289
+ ### [ version 1.9]
290
+
291
+ - more redundant code
292
+ - modules are now separated based on type e.g. Core, Generators, Utils etc.
293
+ - new helper functions and class ` structures.WebPage `
294
+ - Compatible with Python 2.6, 2.7, 3.6, 3.7
0 commit comments