Skip to content

Commit e425742

Browse files
authored
Beta (#11)
### [version 6.0.0] - **Breaking Change** New command-line interface using `Python Fire` library. - Implemented type checks and path normalising in the `config.setup_paths`. - added new dynamic `pywebcopy.__all__` attr generation. - `WebPage` class now doesnt take any argument **(breaking change)** - `WebPage` class has new methods `WebPage.get` and `WebPage.set_source` - Queuing of downloads is replaced with a barrier to manage active threads
1 parent 9c6b7e1 commit e425742

31 files changed

+2238
-1299
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,3 @@
99
Pipfile
1010
Pipfile.lock
1111
/testing/
12-

README.md

Lines changed: 84 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ Why it's great? because it -
2222
- custom html tags handler support
2323
- lots of configuration for many custom needs
2424
- provides several scraping packages in one objects for scraping under one class
25-
- beautifulsoup4
2625
- lxml
2726
- requests
28-
- requests_html
27+
- beautifulsoup4
2928
- pyquery
29+
- requests_html
3030

3131
Email me at `rajatomar788@gmail.com` of any query :)
3232

@@ -94,9 +94,59 @@ Just run this command from root directory of pywebcopy package.
9494

9595

9696
```shell
97-
$ python -m unittest pywebcopy.tests
97+
$ python -m pywebcopy run-tests
98+
```
99+
100+
### 1.4 Command Line Interface
101+
`pywebcopy` have a very easy to use command-line interface which
102+
can help you do task without having to worrying about the inner
103+
long way.
104+
- #### Getting list of commands
105+
```shell
106+
$ python -m pywebcopy -- --help
107+
```
108+
- #### Using apis
109+
```shell
110+
$ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True
111+
or
112+
$ python -m pywebcopy save_website http://google.com E://store// --bypass_robots
113+
```
114+
- #### Running tests
115+
```shell
116+
$ python -m pywebcopy run_tests
117+
```
118+
119+
120+
### 1.5 Authentication and Cookies
121+
Most of the time authentication is needed to access a certain page.
122+
Its real easy to authenticate with `pywebcopy` because it usage an
123+
`requests.Session` object for base http activity which can be accessed
124+
through `pywebcopy.SESSION` attribute. And as you know there
125+
are ton of tutorials on setting up authentication with `requests.Session`.
126+
127+
Here is a basic example of simple http auth -
128+
```python
129+
import pywebcopy
130+
131+
# Update the headers with suitable data
132+
133+
pywebcopy.SESSION.headers.update({
134+
'auth': {'username': 'password'},
135+
'form': {'key1': 'value1'},
136+
})
137+
138+
# Rest of the code is as usual
139+
kwargs = {
140+
'url': 'http://localhost:5000',
141+
'project_folder': 'e://saved_pages//',
142+
'project_name': 'my_site'
143+
}
144+
pywebcopy.config.setup_config(**kwargs)
145+
pywebcopy.save_webpage(**kwargs)
146+
98147
```
99148

149+
100150
### 2.1 `WebPage` class
101151

102152
`WebPage` class, the engine of this saving actions.
@@ -227,7 +277,7 @@ through any method described above
227277
Multiple scraping packages are wrapped up in one object
228278
which you can use to unlock the best of all those libraries
229279
at one go without having to go through the hassle of
230-
instanciating each one of those libraries
280+
instantiating each one of those libraries
231281

232282
> To use all the methods and properties documented below
233283
> just create a object once as described
@@ -303,9 +353,28 @@ wp = MultiParser(html, encoding)
303353
>>> [<Element 'a' href='http://kennethreitz.com/pages'>, ...]
304354
```
305355

306-
## `Crawler` class in `pywebcopy`
307-
Class on which website cloning depends upon.
356+
## `Crawler` object
357+
This is a subclass of `WebPage` class and can be used to mirror any website.
358+
359+
```python
360+
>>> from pywebcopy import Crawler, config
361+
>>> url = 'http://some-url.com/some-page.html'
362+
>>> project_folder = '/home/desktop/'
363+
>>> project_name = 'my_project'
364+
>>> kwargs = {'bypass_robots': True}
365+
# You should always start with setting up the config or use apis
366+
>>> config.setup_config(url, project_folder, project_name, **kwargs)
308367
368+
# Create a instance of the webpage object
369+
>>> wp = Crawler()
370+
371+
# If you want to you can use `requests` to fetch the pages
372+
>>> wp.get(url, **{'auth': ('username', 'password')})
373+
374+
# Then you can access several methods like
375+
>>> wp.crawl()
376+
377+
```
309378

310379

311380
## Common Settings and Errors
@@ -384,7 +453,7 @@ This use case is slightly more powerful as it can provide every functionallity o
384453
>>> config.setup_config(url, project_folder, project_name, **kwargs)
385454
386455
# Create a instance of the webpage object
387-
>>> wp = Webpage()
456+
>>> wp = WebPage()
388457
389458
# If you want to use `requests` to fetch the page then
390459
>>> wp.get(url)
@@ -450,9 +519,10 @@ By creating a Crawler() object which provides several other functions as well.
450519
```python
451520
>>> from pywebcopy import Crawler, config
452521
453-
>>> config.setup_config(project_url='http://localhost:5000/', project_folder='e://tests/', project_name='LocalHost')
522+
>>> config.setup_config(project_url='http://localhost:5000/',
523+
project_folder='e://tests/', project_name='LocalHost')
454524
455-
>>> crawler = Crawler('http://localhost:5000/')
525+
>>> crawler = Crawler()
456526
>>> crawler.crawl()
457527
458528
```
@@ -601,8 +671,10 @@ then you can always create and pull request or email me.
601671
## 6.1 Changelog
602672
603673
### [version 6.0.0]
604-
605-
- `WebPage` class now doesn't take any argument **(breaking change)**
674+
- **Breaking Change** New command-line interface using `Python Fire` library.
675+
- Implemented type checks and path normalising in the `config.setup_paths`.
676+
- added new dynamic `pywebcopy.__all__` attr generation.
677+
- `WebPage` class now doesnt take any argument **(breaking change)**
606678
- `WebPage` class has new methods `WebPage.get` and `WebPage.set_source`
607679
- Queuing of downloads is replaced with a barrier to manage active threads
608680
@@ -614,7 +686,7 @@ then you can always create and pull request or email me.
614686
615687
### [version 4.x]
616688
617-
- *A complete rewrite and restructing of core functionality.*
689+
- *A complete rewrite and restructuring of core functionality.*
618690
619691
### [version 2.0.0]
620692

docs/index.md

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,56 @@ Just run this command from root directory of pywebcopy package.
9393

9494

9595
```shell
96-
$ python -m unittest pywebcopy.tests
96+
$ python -m unittest tests
9797
```
9898

99+
100+
101+
### Command Line Interface
102+
`pywebcopy` have a very easy to use command-line interface which
103+
can help you do task without having to worrying about the inner
104+
long way.
105+
- #### Getting list of commands
106+
```shell
107+
$ python -m pywebcopy -- --help
108+
```
109+
- #### Using apis
110+
```shell
111+
$ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True
112+
or
113+
$ python -m pywebcopy save_website http://google.com E://store// --bypass_robots
114+
```
115+
- #### Running tests
116+
```shell
117+
$ python -m pywebcopy run_tests
118+
```
119+
120+
121+
### Authentication and Cookies
122+
Most of the time authentication is needed to access a certain page.
123+
Its real easy to authenticate with `pywebcopy` because it usage an
124+
`requests.Session` object for base http activity which can be accessed
125+
through `pywebcopy.SESSION` attribute. And as you know there
126+
are ton of tutorials on setting up authentication with `requests.Session`.
127+
128+
Here is a basic example of simple http auth -
129+
```python
130+
import pywebcopy
131+
132+
# Update the headers with suitable data
133+
134+
pywebcopy.SESSION.headers.update({
135+
'auth': {'username': 'password'},
136+
'form': {'key1': 'value1'},
137+
})
138+
139+
# Rest of the code is as usual
140+
kwargs = {
141+
'url': 'http://localhost:5000',
142+
'project_folder': 'e://saved_pages//',
143+
'project_name': 'my_site'
144+
}
145+
pywebcopy.config.setup_config(**kwargs)
146+
pywebcopy.save_webpage(**kwargs)
147+
148+
```

examples.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
# page_url = 'https://www.w3schools.com/'
3636
# page_url = 'https://test-domain.com/'
3737
page_url = 'http://localhost:5000'
38+
3839
handle = open(os.path.join(os.getcwd(), 'tests', 'test.html'), 'rb')
3940
# page_url = 'https://getbootstrap.com/'
4041

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
site_name: PyWebcopy
22
theme: readthedocs
3+
34
nav:
45
- Home: index.md
56
- How-To: how-tos.md

pywebcopy/logger.py renamed to obsolute/_logging.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@
88
- HTMLLogger instance with name, level, title, mode, version etc.
99
- call log, debug, info etc. on the instance
1010
"""
11-
11+
from __future__ import absolute_import
1212
import time
1313
import logging
1414

15-
from .globals import VERSION
15+
from . import __version__
1616

1717

1818
#: HTML header starts the document
@@ -189,7 +189,7 @@ def action(self, message, *args, **kws):
189189
logging.Logger.action = action
190190

191191

192-
def new_html_logger(title="PywebCopy Log", version=VERSION, filename='log.html', mode='w'):
192+
def new_html_logger(title="PywebCopy Log", version=__version__, filename='log.html', mode='w'):
193193
"""Creates a new html file logging handler for use in logger.
194194
195195
:rtype: HTMLFileHandler
@@ -214,7 +214,7 @@ def new_console_logger(level=logging.WARNING):
214214
"""
215215
c_logger = logging.StreamHandler()
216216
c_logger.setLevel(level)
217-
c_logger.setFormatter(logging.Formatter("%(levelname)s - %(message)s"))
217+
c_logger.setFormatter(logging.Formatter("%(levelname)-8s - %(message)s"))
218218
return c_logger
219219

220220

obsolute/core.py

Whitespace-only changes.

obsolute/utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,13 @@
99

1010
import os
1111
import re
12-
12+
import logging
1313
from six.moves.urllib.parse import urljoin, urlsplit, urlparse
1414
from six.moves.urllib.request import pathname2url, url2pathname
1515

16-
from pywebcopy import LOGGER
1716
from configs import config
1817

18+
LOGGER = logging.getLogger('utils')
1919
DEBUG = config['DEBUG']
2020

2121

pywebcopy/__init__.py

Lines changed: 36 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,57 @@
11
# -*- coding: utf-8 -*-
2+
#
3+
# Copyright 2019 Raja Tomar
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
215
"""
16+
317
pywebcopy
418
~~~~~~~~~
519
6-
Python library to clone complete webpages and websites.
7-
8-
9-
Copyright 2019 Raja Tomar
20+
Python library to clone web-pages and websites with all its peripheral files.
1021
11-
Licensed under the Apache License, Version 2.0 (the "License");
12-
you may not use this file except in compliance with the License.
13-
You may obtain a copy of the License at
14-
http://www.apache.org/licenses/LICENSE-2.0
15-
Unless required by applicable law or agreed to in writing, software
16-
distributed under the License is distributed on an "AS IS" BASIS,
17-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18-
See the License for the specific language governing permissions and
19-
limitations under the License.
22+
.. version changed :: 6.0.0
23+
1. **Breaking Change** New command-line interface using `Python Fire` library.
24+
2. Implemented type checks and path normalising in the `config.setup_paths`.
2025
2126
"""
22-
23-
2427
__author__ = 'Raja Tomar'
2528
__email__ = 'rajatomar788@gmail.com'
2629
__license__ = 'Apache License 2.0'
27-
__version__ = (6, 0, 0, 'rc', 1)
30+
__version__ = '6.0.0'
2831

32+
import logging
2933

30-
from .globals import *
31-
from .logger import LOGGER # Global Logger instance
3234
from .configs import config, SESSION
33-
from .urls import URLTransformer, filename_present
34-
from .elements import LinkTag, ScriptTag, ImgTag, AnchorTag, TagBase
35+
from .parsers import Parser, MultiParser
3536
from .webpage import WebPage
36-
from .parsers import MultiParser
37-
from .core import get, new_file
3837
from .crawler import Crawler
3938
from .api import save_website, save_webpage
4039

41-
4240
__all__ = [
43-
'save_webpage', 'save_website', #: apis
44-
'config', #: configuration
45-
'WebPage', 'Crawler', 'MultiParser', #: Classes
46-
'SESSION', #: Http Session
47-
'URLTransformer', 'filename_present', #: Url manipulation
48-
'TagBase', 'LinkTag', 'ScriptTag', 'ImgTag', 'AnchorTag', #: Customisable tag handling
49-
'get', 'new_file', #: some goodies
41+
'WebPage', 'Crawler',
42+
'save_webpage', 'save_website',
43+
'config', 'SESSION',
44+
'Parser', 'MultiParser',
5045
]
5146

52-
#: alias
53-
Webpage = WebPage
54-
55-
56-
def __dir__():
57-
return __all__ + (__version__, __author__, __email__, __license__, Webpage)
58-
59-
47+
#: optimisations
48+
logging.logThreads = 0
49+
logging.logProcesses = 0
50+
logging._srcfile = None
51+
c_handler = logging.StreamHandler()
52+
logging.basicConfig(
53+
level=logging.DEBUG,
54+
handlers=[c_handler],
55+
format='%(name)-10s - %(levelname)-8s - %(message)s'
56+
)
57+
c_handler.setLevel(logging.INFO)

0 commit comments

Comments
 (0)