@@ -22,11 +22,11 @@ Why it's great? because it -
22
22
- custom html tags handler support
23
23
- lots of configuration for many custom needs
24
24
- provides several scraping packages in one objects for scraping under one class
25
- - beautifulsoup4
26
25
- lxml
27
26
- requests
28
- - requests_html
27
+ - beautifulsoup4
29
28
- pyquery
29
+ - requests_html
30
30
31
31
Email me at ` rajatomar788@gmail.com ` of any query :)
32
32
@@ -94,9 +94,59 @@ Just run this command from root directory of pywebcopy package.
94
94
95
95
96
96
``` shell
97
- $ python -m unittest pywebcopy.tests
97
+ $ python -m pywebcopy run-tests
98
+ ```
99
+
100
+ ### 1.4 Command Line Interface
101
+ ` pywebcopy ` have a very easy to use command-line interface which
102
+ can help you do task without having to worrying about the inner
103
+ long way.
104
+ - #### Getting list of commands
105
+ ``` shell
106
+ $ python -m pywebcopy -- --help
107
+ ```
108
+ - # ### Using apis
109
+ ` ` ` shell
110
+ $ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True
111
+ or
112
+ $ python -m pywebcopy save_website http://google.com E://store// --bypass_robots
113
+ ` ` `
114
+ - # ### Running tests
115
+ ` ` ` shell
116
+ $ python -m pywebcopy run_tests
117
+ ` ` `
118
+
119
+
120
+ # ## 1.5 Authentication and Cookies
121
+ Most of the time authentication is needed to access a certain page.
122
+ Its real easy to authenticate with ` pywebcopy` because it usage an
123
+ ` requests.Session` object for base http activity which can be accessed
124
+ through ` pywebcopy.SESSION` attribute. And as you know there
125
+ are ton of tutorials on setting up authentication with ` requests.Session` .
126
+
127
+ Here is a basic example of simple http auth -
128
+ ` ` ` python
129
+ import pywebcopy
130
+
131
+ # Update the headers with suitable data
132
+
133
+ pywebcopy.SESSION.headers.update({
134
+ ' auth' : {' username' : ' password' },
135
+ ' form' : {' key1' : ' value1' },
136
+ })
137
+
138
+ # Rest of the code is as usual
139
+ kwargs = {
140
+ ' url' : ' http://localhost:5000' ,
141
+ ' project_folder' : ' e://saved_pages//' ,
142
+ ' project_name' : ' my_site'
143
+ }
144
+ pywebcopy.config.setup_config(** kwargs)
145
+ pywebcopy.save_webpage(** kwargs)
146
+
98
147
` ` `
99
148
149
+
100
150
# ## 2.1 `WebPage` class
101
151
102
152
` WebPage` class, the engine of this saving actions.
@@ -227,7 +277,7 @@ through any method described above
227
277
Multiple scraping packages are wrapped up in one object
228
278
which you can use to unlock the best of all those libraries
229
279
at one go without having to go through the hassle of
230
- instanciating each one of those libraries
280
+ instantiating each one of those libraries
231
281
232
282
> To use all the methods and properties documented below
233
283
> just create a object once as described
@@ -303,9 +353,28 @@ wp = MultiParser(html, encoding)
303
353
>>> [< Element ' a' href= ' http://kennethreitz.com/pages' > , ...]
304
354
` ` `
305
355
306
- # # `Crawler` class in `pywebcopy`
307
- Class on which website cloning depends upon.
356
+ ## ` Crawler` object
357
+ This is a subclass of ` WebPage` class and can be used to mirror any website.
358
+
359
+ ` ` ` python
360
+ >>> from pywebcopy import Crawler, config
361
+ >>> url = ' http://some-url.com/some-page.html'
362
+ >>> project_folder = ' /home/desktop/'
363
+ >>> project_name = ' my_project'
364
+ >>> kwargs = {' bypass_robots' : True}
365
+ # You should always start with setting up the config or use apis
366
+ >>> config.setup_config(url, project_folder, project_name, ** kwargs)
308
367
368
+ # Create a instance of the webpage object
369
+ >>> wp = Crawler ()
370
+
371
+ # If you want to you can use `requests` to fetch the pages
372
+ >>> wp.get(url, ** {' auth' : (' username' , ' password' )})
373
+
374
+ # Then you can access several methods like
375
+ >>> wp.crawl ()
376
+
377
+ ` ` `
309
378
310
379
311
380
## Common Settings and Errors
@@ -384,7 +453,7 @@ This use case is slightly more powerful as it can provide every functionallity o
384
453
>>> config.setup_config(url, project_folder, project_name, ** kwargs)
385
454
386
455
# Create a instance of the webpage object
387
- >> > wp = Webpage ()
456
+ >>> wp = WebPage ()
388
457
389
458
# If you want to use `requests` to fetch the page then
390
459
>>> wp.get(url)
@@ -450,9 +519,10 @@ By creating a Crawler() object which provides several other functions as well.
450
519
` ` ` python
451
520
>>> from pywebcopy import Crawler, config
452
521
453
- >> > config.setup_config(project_url = ' http://localhost:5000/' , project_folder = ' e://tests/' , project_name = ' LocalHost' )
522
+ >>> config.setup_config(project_url=' http://localhost:5000/' ,
523
+ project_folder=' e://tests/' , project_name=' LocalHost' )
454
524
455
- >> > crawler = Crawler(' http://localhost:5000/ ' )
525
+ >>> crawler = Crawler ()
456
526
>>> crawler.crawl ()
457
527
458
528
` ` `
@@ -601,8 +671,10 @@ then you can always create and pull request or email me.
601
671
# # 6.1 Changelog
602
672
603
673
# ## [version 6.0.0]
604
-
605
- - `WebPage` class now doesn' t take any argument **(breaking change)**
674
+ - ** Breaking Change** New command-line interface using ` Python Fire` library.
675
+ - Implemented type checks and path normalising in the ` config.setup_paths` .
676
+ - added new dynamic ` pywebcopy.__all__` attr generation.
677
+ - ` WebPage` class now doesnt take any argument ** (breaking change)**
606
678
- ` WebPage` class has new methods ` WebPage.get` and ` WebPage.set_source`
607
679
- Queuing of downloads is replaced with a barrier to manage active threads
608
680
@@ -614,7 +686,7 @@ then you can always create and pull request or email me.
614
686
615
687
# ## [version 4.x]
616
688
617
- - * A complete rewrite and restructing of core functionality.*
689
+ - * A complete rewrite and restructuring of core functionality.*
618
690
619
691
# ## [version 2.0.0]
620
692
0 commit comments