Skip to content

Commit e64e370

Browse files
committed
v2.0 beta
1 parent cbdfc96 commit e64e370

18 files changed

+1569
-1185
lines changed

README.md

Lines changed: 114 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1-
# PyWebCopy ©
1+
# PyWebCopy © 2.0(beta)
22

33
`Created By : Raja Tomar`
44
`License : MIT`
55

6+
Mirrors Complete webpages with python.
7+
68
Website mirroring and archiving tool written in Python
79
Archive any online website and its assets, css, js and
810
images for offilne reading, storage or whatever reasons.
@@ -17,56 +19,89 @@ Why it's great? because it -
1719

1820
Email me at `rajatomar788@gmail.com` of any query :)
1921

20-
## Installation
22+
## 1.1 Installation
2123

2224
`pywebcopy` is available on PyPi and is easily installable using `pip`
2325

2426
```Python
2527
pip install pywebcopy
2628
```
2729

28-
## Basic Usages
30+
## 1.2 Basic Usages
2931

32+
### 1.2.1 Direct Function Methods
3033
To mirror any single page, just type in python console
3134

3235
```Python
33-
from pywebcopy.core import init
36+
from pywebcopy.core import save_webpage
37+
3438

35-
init(url='http://example-site.com/index.html')
39+
save_webpage(
40+
url='http://example-site.com/index.html',
41+
download_loc='path/to/downloads'
42+
)
3643
```
3744

3845
To mirror full website (This could overload the target server, So, be careful)
3946

4047
```Python
41-
from pywebcopy.core import init
48+
from pywebcopy.core import save_webpage
4249

43-
init(
50+
51+
save_webpage(
4452
url='http://example-site.com/index.html',
45-
copy_all = True
46-
)
53+
download_loc='path/to/downloads',
54+
copy_all=True
55+
)
56+
```
57+
58+
### 1.2.2 Object Creation Method
59+
60+
```Python
61+
from pywebcopy.structures import WebPage
62+
63+
url = 'http://example-site.com/index.html'
64+
download_loc = 'path/to/downloads/folder'
65+
66+
wp = WebPage(url, download_loc)
67+
68+
# if you want assets only
69+
wp.save_assets_only()
70+
71+
# if you want html only
72+
wp.save_html_only()
73+
74+
# if you want complete webpage
75+
wp.save_complete()
76+
77+
# bonus : you can also use any beautiful_soup methods on it
78+
links = wp.find_all('a', href=True)
79+
4780
```
4881

4982
that's it.
5083

51-
You will now have a folder in C: drive
52-
`C:\WebCopyProjects\example-site.com\example-site.com\`
84+
You will now have a folder at `download_loc` with all the webpage and its linked files ready to be used.
5385

5486
Just browse it as would on any browser!
5587

56-
## Configuration
88+
## 1.3 Configuration
5789

5890
`pywebcopy` is highly configurable.
5991

92+
### 1.3.1 Direct Call Method
93+
6094
To change any configuration, just pass it to the `init` call.
6195

6296
Example:
6397

6498
```Python
65-
from pywebcopy.core import init
99+
from pywebcopy.core import save_webpage
66100

67-
init(
101+
save_webpage(
68102

69103
url='http://some-site.com/', # required
104+
download_loc='path/to/downloads/', # required
70105

71106
# config keys are case-insensitive
72107
any_config_key='new_value',
@@ -78,14 +113,54 @@ init(
78113
)
79114
```
80115

116+
### 1.3.2 `core.setup_config` Method
117+
118+
You can manually configure every configuration by using a
119+
`core.setup_config` call.
120+
121+
```Python
122+
123+
import pywebcopy
124+
125+
url = 'http://example-site.com/index.html'
126+
download_loc = 'path/to/downloads/'
127+
128+
pywebcopy.core.setup_config(url, download_loc)
129+
130+
# done!
131+
132+
>>> pywebcopy.config.config['url']
133+
'http://example-site.com/index.html'
134+
135+
>>> pywebcopy.config.config['mirrors_dir']
136+
'path/to/downloads'
137+
138+
>>> pywebcopy.config.config['project_name']
139+
'example-site.com'
140+
141+
142+
## You can also change any of these by just adding param to
143+
## `setup_config` call
144+
145+
>>> pywebcopy.core.setup_config(url,
146+
download_loc,project_name='Your-Project', ...)
147+
148+
## You can also change any config even after
149+
## the `setup_config` call
150+
151+
pywebcopy.config.config['url'] = 'http://url-changed.com'
152+
# rest of config remains unchanged
153+
154+
```
155+
81156
Done!
82157

83-
### List of available `configurations`
158+
### 1.3.3 List of available `configurations`
84159

85160
below is the list of `config` keys with their `default` values :
86161

87162
``` Python
88-
# writes the log file content to console directly
163+
# writes the trace output and log file content to console directly
89164
'DEBUG': False
90165

91166
# make zip archive of the downloaded content
@@ -118,9 +193,6 @@ below is the list of `config` keys with their `default` values :
118193
'.jpeg', '.jpg', '.png', '.ttf',
119194
'.eot', '.otf', '.woff']
120195

121-
# file to write all valid links found on pages
122-
'LINK_INDEX_FILE': None
123-
124196
# log file path
125197
'LOG_FILE': None
126198

@@ -142,7 +214,7 @@ below is the list of `config` keys with their `default` values :
142214
'URL': None
143215

144216
# define the base directory to store all copied sites data
145-
'MIRRORS_DIR': C:/WebCopyProjects/ + Project_Name
217+
'MIRRORS_DIR': None
146218

147219
# all downloaded file location
148220
# available after any project completion
@@ -159,16 +231,15 @@ below is the list of `config` keys with their `default` values :
159231
'FILENAME_VALIDATION_PATTERN': re.compile(r'[*":<>\|\?]+')
160232

161233
# user agent to be shown on requests made to server
162-
'USER_AGENT' : Mozilla/4.0 (compatible; WebCopyBot/X.X;
163-
+Non-Harmful-LightWeight)
234+
'USER_AGENT' : Mozilla/5.0 (compatible; WebCopyBot/X.X;)
164235

165236
# bypass the robots.txt restrictions
166237
'BYPASS_ROBOTS' : False
167238
```
168239

169240
told you there were plenty of `config` vars available!
170241

171-
## Help
242+
## 1.4 Help
172243

173244
For any queries related to this project you can email me at
174245
`rajatomar788@gmail.com`
@@ -181,7 +252,7 @@ You can help in many ways:
181252

182253
Thanks!
183254

184-
## Undocumented Features
255+
## 1.5 Undocumented Features
185256

186257
I built many utils and classes in this project to ease
187258
the tasks I was trying to do.
@@ -192,14 +263,20 @@ these task are also suitable for general purpose use.
192263
So,
193264
if you want, you can help in generating suitable `documentation` for these undocumented ones, then you can always email me.
194265

195-
## Changelog
266+
## 1.6 Changelog
196267

197-
### [version 1.9]
268+
### [version 2.0(beta)]
269+
270+
- `init` function is replaced with `save_webpage`
271+
- three new `config` automation functions are added -
272+
- `core.setup_config` (creates every ideal config just from url and download location)
273+
- `config.reset_config` (resets the configuration to default state)
274+
- `config.update_config` (manual-mode version of `core.setup_config`)
275+
- object `structures.WebPage` added
276+
- merged `generators.generate_style_map` and `generators.generate_relative_paths` to a single function `generators.generate_style_map`
277+
- rewrite of majority of functions
278+
- new module `exceptions` added
198279

199-
- more redundant code
200-
- modules are now separated based on type e.g. Core, Generators, Utils etc.
201-
- new helper functions and class `structures.WebPage`
202-
- Compatible with Python 2.6, 2.7, 3.6, 3.7
203280

204281
### [version 1.10]
205282

@@ -208,3 +285,10 @@ if you want, you can help in generating suitable `documentation` for these undoc
208285
- `init` call now takes `url` arg by default and could raise a error when not supplied
209286
- professional looking log entries
210287
- rewritten archiving system now uses `zipfile` and `exceptions` handling to prevent errors and eventual archive corruption
288+
289+
### [version 1.9]
290+
291+
- more redundant code
292+
- modules are now separated based on type e.g. Core, Generators, Utils etc.
293+
- new helper functions and class `structures.WebPage`
294+
- Compatible with Python 2.6, 2.7, 3.6, 3.7

build/lib/pywebcopy/__init__.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515
import config
1616
import utils
1717
import generators
18-
18+
import exceptions
19+
import test_generators
1920

2021
__version__ = config.config['version']
2122
__author__ = 'Raja Tomar'
@@ -26,9 +27,7 @@
2627

2728
__all__ = [
2829
'__version__', '__author__', '__copyright__', '__license__', '__email__',
29-
'core', 'structures', 'config', 'utils', 'generators'
30+
'core', 'structures', 'config', 'utils', 'generators', 'exceptions'
3031
]
3132

32-
if __name__ == "__main__":
33-
import os
34-
core.init("https://google.com", mirrors_dir=os.path.join("E:\Programming\Projects\WebsiteCopier\\", "Mirrors_dir"), bypass_robots=True, over_write=True)
33+

build/lib/pywebcopy/config.py

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,17 @@
1212

1313

1414
__all__ = [
15-
'config'
15+
'config', 'update_config', 'reset_config'
1616
]
1717

1818
# --------------------------------------------------
19-
# DO NOT MODIFY, you can change these through init()
19+
# DO NOT MODIFY, you can change these through update_config()
2020
# --------------------------------------------------
2121
config = structures.CaseInsensitiveDict({
22-
23-
'VERSION': '1.9.2',
24-
22+
# version no. of this build
23+
'VERSION': '2.0beta',
24+
# not so helpful debug switch, it just dumps the log
25+
# on the console
2526
'DEBUG': False,
2627
# make zip archive of the downloaded content
2728
'MAKE_ARCHIVE': True,
@@ -38,16 +39,15 @@
3839
# to overwrite the existing files if found
3940
'OVER_WRITE': False,
4041
# allowed file extensions
41-
'ALLOWED_FILE_EXT': ['.html', '.php', '.asp', '.htm', '.xhtml', '.css', '.json', '.js', '.xml', '.svg', '.gif', '.ico', '.jpeg',
42+
'ALLOWED_FILE_EXT': ['.html', '.php', '.asp', '.htm', '.xhtml', '.css',
43+
'.json', '.js', '.xml', '.svg', '.gif', '.ico', '.jpeg',
4244
'.jpg', '.png', '.ttf', '.eot', '.otf', '.woff'],
43-
# file to write all valid links found on pages
44-
'LINK_INDEX_FILE': None,
4545
# log file path
4646
'LOG_FILE': None,
4747
# reduce log produced by removing unnecessary info from log file
4848
'LOG_FILE_COMPRESSION': False,
4949
# log buffering store log in ram until finished, then write to file
50-
'LOG_BUFFERING': True,
50+
'LOG_BUFFERING': False,
5151
# log buffer holder for performance speed up
5252
'LOG_BUFFER_ARRAY': list(),
5353
# this pattern is used to validate file names
@@ -64,8 +64,24 @@
6464
'DOWNLOAD_SIZE': 0
6565
})
6666

67-
# user agent to be shown on requests made to server
68-
config['USER_AGENT'] = 'Mozilla/4.0 (compatible; WebCopyBot/{}; +Non-Harmful-LightWeight)'.format(config['version'])
6967

7068
# HANDLE WITH CARE
69+
config['USER_AGENT'] = 'Mozilla/5.0 (compatible; PywebcopyBot/{})'.format(config['version'])
70+
config['ROBOTS'] = structures.RobotsTxt()
7171
config['BYPASS_ROBOTS'] = False
72+
73+
74+
""" This is used in to store default config as backup """
75+
default_config = config
76+
77+
78+
def update_config(**kwargs):
79+
""" Updates the default `config` dict """
80+
config.update(**kwargs)
81+
82+
83+
def reset_config():
84+
""" Resets all to configuration to default state. """
85+
global config
86+
config = default_config
87+

0 commit comments

Comments
 (0)