Skip to content

Commit a729092

Browse files
committed
v6.0.0 Beta version testing
new version for testing. Some Apis are same but some has changed. Suitable documentation will be provided after internal testings.
1 parent 6eb2ad1 commit a729092

31 files changed

+2143
-1086
lines changed

CODE_OF_CONDUCT.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Be humble or Be on your way.

CONTRIBUTING.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#Contribution Guidelines
2+
To get the greatest chance of helpful responses, please also observe the following notes.
3+
4+
## Questions
5+
6+
The GitHub issue tracker is for bug reports and feature requests.
7+
Please do not use it to ask questions about how to use library.
8+
These questions should instead be directed to Stack Overflow.
9+
Make sure that your question is tagged with the python-pywebcopy tag when asking it on Stack Overflow,
10+
to ensure that it is answered promptly and accurately.
11+
12+
## Good Bug Reports
13+
14+
Please be aware of the following things when filing bug reports:
15+
- Avoid raising duplicate issues.
16+
- Please use the GitHub issue search feature to check whether your bug report or feature request has
17+
been mentioned in the past.
18+
- Duplicate bug reports and feature requests are a huge maintenance burden on the limited resources of the project.
19+
- If it is clear from your report that you would have struggled to find the original, that's ok, but if searching
20+
for a selection of words in your issue title would have found the duplicate then the issue will likely be closed
21+
extremely abruptly.
22+
- When filing bug reports about exceptions or tracebacks, please include the complete traceback.
23+
Partial tracebacks, or just the exception text, are not helpful.
24+
Issues that do not contain complete tracebacks may be closed without warning.
25+
26+
- Make sure you provide a suitable amount of information to work with. This means you should provide:
27+
- Guidance on how to reproduce the issue. Ideally, this should be a small code sample that
28+
can be run immediately by the maintainers.
29+
Failing that, let us know what you're doing, how often it happens,
30+
what environment you're using, etc. Be thorough: it prevents us needing to ask further questions.
31+
- Tell us what you expected to happen. When we run your example code, what are we expecting to happen? What does "success" look like for your code?
32+
- Tell us what actually happens. It's not helpful for you to say "it doesn't work" or "it fails".
33+
- Tell us how it fails: do you get an exception? A hang? How was the actual result different from your expected result?
34+
- Tell us what version of the library you're using, and how you installed it.
35+
Different versions of the libraries behave differently and have different bugs,
36+
and some distributors of the library ship patches on top of the code we supply.
37+
If you do not provide all of these things,
38+
it will take us much longer to fix your problem.
39+
If we ask you to clarify these and you never respond, we will close your issue without fixing it.

LICENSE

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,13 @@
1-
Copyright (c) 2018 The Python Packaging Authority
1+
Copyright 2019 Raja Tomar
22

3-
Permission is hereby granted, free of charge, to any person obtaining a copy
4-
of this software and associated documentation files (the "Software"), to deal
5-
in the Software without restriction, including without limitation the rights
6-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7-
copies of the Software, and to permit persons to whom the Software is
8-
furnished to do so, subject to the following conditions:
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
96

10-
The above copyright notice and this permission notice shall be included in all
11-
copies or substantial portions of the Software.
7+
http://www.apache.org/licenses/LICENSE-2.0
128

13-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19-
SOFTWARE.
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.

README.md

Lines changed: 77 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
# PyWebCopy © 5
1+
# PyWebCopy © 6
22

33
`Created By : Raja Tomar`
44
`License : MIT`
55
`Email: rajatomar788@gmail.com`
66

7-
Web Scraping and Saving Complete webpages and websites with python.
7+
Python websites and webpages cloning at ease.
8+
Web Scraping or Saving Complete webpages and websites with python.
89

910
Web scraping and archiving tool written in Python
1011
Archive any online website and its assets, css, js and
@@ -14,9 +15,13 @@ It's easy with `pywebcopy`.
1415
Why it's great? because it -
1516

1617
- respects `robots.txt`
17-
- have a single-function basic usages
18+
- saves a webpage with css, js and images with one call
19+
- clones a complete website with assets and links remapped in one call
20+
- have direct apis for simplicity and ease
21+
- subclassing for advanced usage
22+
- custom html tags handler support
1823
- lots of configuration for many custom needs
19-
- provides several scraping packages in one Objects (thanks to their original owners)
24+
- provides several scraping packages in one objects (thanks to their original owners)
2025
- beautifulsoup4
2126
- lxml
2227
- requests
@@ -37,12 +42,12 @@ You are ready to go. Read the tutorials below to get started.
3742

3843
## First steps
3944

40-
You should always check if the pywebcopy is installed successfully.
45+
You should always check if the latest pywebcopy is installed successfully.
4146

4247
```python
4348
>>> import pywebcopy
4449
>>> pywebcopy.__version___
45-
5.x
50+
6.x
4651
```
4752

4853
Your version may be different, now you can continue the tutorial.
@@ -54,10 +59,12 @@ To save any single page, just type in python console
5459
```Python
5560
from pywebcopy import save_webpage
5661

62+
kwargs = {'project_name': 'some-fancy-name'}
5763

5864
save_webpage(
5965
url='http://example-site.com/index.html',
60-
project_folder='path/to/downloads'
66+
project_folder='path/to/downloads',
67+
**kwargs
6168
)
6269
```
6370

@@ -66,15 +73,18 @@ To save full website (This could overload the target server, So, be careful)
6673
```Python
6774
from pywebcopy import save_website
6875

76+
kwargs = {'project_name': 'some-fancy-name'}
77+
6978
save_website(
7079
url='http://example-site.com/index.html',
7180
project_folder='path/to/downloads',
81+
**kwargs
7282
)
7383
```
7484

7585
### 1.2.1 Running Tests
7686
Running tests is simple and doesn't require any external library.
77-
Just run this command from root directory of pywebcopy package
87+
Just run this command from root directory of pywebcopy package.
7888

7989

8090
```shell
@@ -89,24 +99,24 @@ from pywebcopy import WebPage
8999
url = 'http://example-site.com/index.html' or None
90100
project_loc = 'path/to/downloads/folder'
91101

92-
wp = WebPage(url,
93-
project_folder
94-
default_encoding=None,
95-
HTML=None,
96-
**configKwargs
97-
)
102+
wp = WebPage()
98103

99104
# You can choose to load the page explicitly using
100105
# `requests` module
101106
wp.get(url, **requestsKwargs)
102107

108+
# OR
109+
# You can choose to set the source yourself
110+
handle = open('file.html', 'rb')
111+
wp.set_source(handle)
112+
103113
# if you want assets only
104114
wp.save_assets()
105115

106116
# if you want html only
107117
wp.save_html()
108118

109-
# if you want complete webpage
119+
# if you want complete webpage with css, js and images
110120
wp.save_complete()
111121
```
112122

@@ -171,6 +181,7 @@ then check if website allows scraping of its content.
171181
>>> pywebcopy.config['bypass_robots'] = True
172182

173183
# rest of your code follows..
184+
174185
```
175186

176187
### Overwrite existing files when copying
@@ -183,6 +194,7 @@ use the over_write config key.
183194
>>> pywebcopy.config['over_write'] = True
184195

185196
# rest of your code follows..
197+
186198
```
187199

188200
### Changing your project name
@@ -196,6 +208,7 @@ below
196208
>>> pywebcopy.config['project_name'] = 'my_project'
197209

198210
# rest of your code follows..
211+
199212
```
200213

201214
## How to - Save Single Webpage
@@ -204,28 +217,42 @@ Particular webpage can be saved easily using the following methods.
204217

205218
Note: if you get `pywebcopy.exceptions.AccessError` when running any of these code then use the code provided on later sections.
206219

207-
### Method 1
220+
### Method 1 : via api - `save_webpage()`
208221

209222
Webpage can easily be saved using an inbuilt funtion called `.save_webpage()` which takes several
210223
arguments also.
211224

212225
```python
213-
>>> import pywebcopy
214-
>>> pywebcopy.save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)
226+
>>> from pywebcopy import save_webpage
227+
>>> save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)
215228

216-
# rest of your code follows..
217229
```
218230

219231
### Method 2
220232

221-
This use case is slightly more powerful as it can provide every functionallity of the WebPage
222-
data class.
233+
This use case is slightly more powerful as it can provide every functionallity of the WebPage class.
223234

224235
```python
225-
>>> from pywebcopy import Webpage
236+
>>> from pywebcopy import Webpage, config
237+
>>> url = 'http://some-url.com/some-page.html'
238+
239+
# You should always start with setting up the config or use apis
240+
>>> config.setup_config(url, project_folder, project_name, **kwargs)
241+
242+
# Create a instance of the webpage object
243+
>>> wp = Webpage()
244+
245+
# If you want to use `requests` to fetch the page then
246+
>>> wp.get(url)
247+
248+
# Else if you want to use plain html or urllib then use
249+
>>> wp.set_source(object_which_have_a_read_method, encoding=encoding)
250+
>>> wp.url = url # you need to do this if you are using set_source()
226251

227-
>>> wp = WebPage('http://google.com', 'e://tests/', project_name='Google')
252+
# Then you can access several methods like
228253
>>> wp.save_complete()
254+
>>> wp.save_html()
255+
>>> wp.save_assets()
229256

230257
# This Webpage object contains every methods of the Webpage() class and thus
231258
# can be reused for later usages.
@@ -242,44 +269,50 @@ One feature is that the raw html is now also accepted.
242269

243270
```python
244271

245-
>>> from pywebcopy import Webpage
272+
>>> from pywebcopy import Webpage, config
246273

247274
>>> HTML = open('test.html').read()
248275

249276
>>> base_url = 'http://example.com' # used as a base for downloading imgs, css, js files.
250277
>>> project_folder = '/saved_pages/'
278+
>>> config.setup_config(base_url, project_folder)
251279

252-
>>> wp = WebPage(base_url, project_folder, HTML=HTML)
280+
>>> wp = WebPage()
281+
>>> wp.set_source(HTML)
282+
>>> wp.url = base_url
253283
>>> wp.save_webpage()
284+
254285
```
255286

256-
## How to - Whole Websites
287+
## How to - Clone Whole Websites
257288

258289
Use caution when copying websites as this can overload or damage the
259290
servers of the site and rarely could be illegal, so check everything before
260291
you proceed.
261292

262-
### Method 1 -
293+
### Method 1 : via api - `save_website()`
263294

264295
Using the inbuilt api `.save_website()` which takes several arguments.
265296

266297
```python
267-
>>> import pywebcopy
298+
>>> from pywebcopy import save_website
299+
300+
>>> save_website(project_url='http://localhost:8000', project_folder='e://tests/')
268301

269-
>>> pywebcopy.save_website(project_url='http://localhost:8000', project_folder='e://tests/')
270302
```
271303

272304
### Method 2 -
273305

274306
By creating a Crawler() object which provides several other functions as well.
275307

276308
```python
277-
>>> import pywebcopy
309+
>>> from pywebcopy import Crawler, config
278310

279-
>>> pywebcopy.config.setup_config(project_url='http://localhost:5000/', project_folder='e://tests/', project_name='LocalHost')
311+
>>> config.setup_config(project_url='http://localhost:5000/', project_folder='e://tests/', project_name='LocalHost')
280312

281-
>>> crawler = pywebcopy.Crawler('http://localhost:5000/')
313+
>>> crawler = Crawler('http://localhost:5000/')
282314
>>> crawler.crawl()
315+
283316
```
284317

285318
## Contribution
@@ -296,33 +329,36 @@ If you have any suggestions or fixes or reports feel free to mail me :)
296329

297330
`pywebcopy` is highly configurable.
298331

299-
### 1.3.1 Direct Call Method
332+
### 1.3.1 APIS
300333

301-
To change any configuration, just pass it to the `init` call.
334+
To change any configuration, just pass it to the `api` call.
302335

303336
Example:
304337

305338
```Python
306-
from pywebcopy.core import save_webpage
339+
from pywebcopy import save_webpage
340+
341+
kwargs = {
342+
'key1': 'value1',
343+
...
344+
}
307345

308346
save_webpage(
309347

310348
url='http://some-site.com/', # required
311349
download_loc='path/to/downloads/', # required
312350

313-
# config keys are case-insensitive
314-
any_config_key='new_value',
315-
another_config_key='another_new_value',
351+
kwargs=kwargs
316352

317353
...
318354

319355
# add many as you want :)
320356
)
357+
321358
```
322359

323360
### 1.3.2 `config.setup_config` Method
324361

325-
>**This function is changed from `core.setup_config`**
326362

327363
You can manually configure every configuration by using a
328364
`config.setup_config` call.
@@ -378,12 +414,6 @@ below is the list of `config` keys with their `default` values :
378414
# delete the project folder after making zip archive of it
379415
'delete_project_folder': False
380416

381-
# which parser to use when parsing pages
382-
# for speed choose 'html.parser' (will crack some webpages)
383-
# for exact webpage copy choose 'html5lib' (a little slow)
384-
# or you can leave it to default 'lxml' (balanced)
385-
'PARSER' : 'lxml'
386-
387417
# to download css file or not
388418
'LOAD_CSS': True
389419

@@ -398,10 +428,7 @@ below is the list of `config` keys with their `default` values :
398428
'OVER_WRITE': False
399429

400430
# list of allowed file extensions
401-
'ALLOWED_FILE_EXT': ['.html', '.css', '.json', '.js',
402-
'.xml','.svg', '.gif', '.ico',
403-
'.jpeg', '.jpg', '.png', '.ttf',
404-
'.eot', '.otf', '.woff']
431+
'ALLOWED_FILE_EXT': ['.html', '.css', ...]
405432

406433
# log file path
407434
'LOG_FILE': None
@@ -425,6 +452,7 @@ below is the list of `config` keys with their `default` values :
425452

426453
# bypass the robots.txt restrictions
427454
'BYPASS_ROBOTS' : False
455+
428456
```
429457

430458
told you there were plenty of `config` vars available!

0 commit comments

Comments
 (0)