Skip to content

Commit e73aa71

Browse files
authored
Merge pull request #17 from celebi-pkg/dev
v1.2.0 Complete overhaul of scraping setup
2 parents c92ce2a + 3583bde commit e73aa71

File tree

6 files changed

+451
-127
lines changed

6 files changed

+451
-127
lines changed

README.md

Lines changed: 42 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
11
[![kcelebi](https://circleci.com/gh/celebi-pkg/flight-analysis.svg?style=svg)](https://circleci.com/gh/celebi-pkg/flight-analysis)
22
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3-
[![Live on PyPI](https://img.shields.io/badge/PyPI-1.1.0-brightgreen)](https://pypi.org/project/google-flight-analysis/)
3+
[![Live on PyPI](https://img.shields.io/badge/PyPI-1.2.0-brightgreen)](https://pypi.org/project/google-flight-analysis/)
4+
[![TestPyPI](https://img.shields.io/badge/PyPI-1.1.1--alpha.11-blue)](https://test.pypi.org/project/google-flight-analysis/1.1.1a11/)
45

56
# Flight Analysis
67

7-
This project provides tools and models for users to analyze, forecast, and collect data regarding flights and prices. There are currently many features in initial stages and in development. The current features (as of 4/5/2023) are:
8+
This project provides tools and models for users to analyze, forecast, and collect data regarding flights and prices. There are currently many features in initial stages and in development. The current features (as of 5/25/2023) are:
89

9-
- Scraping tools for Google Flights
10+
- Detailed scraping and querying tools for Google Flights
11+
- Ability to store data locally or to SQL tables
1012
- Base analytical tools/methods for price forecasting/summary
1113

1214
The features in development are:
1315

1416
- Models to demonstrate ML techniques on forecasting
17+
- Querying of advanced features
1518
- API for access to previously collected data
1619

1720
## Table of Contents
@@ -59,19 +62,46 @@ For GitHub repository cloners, import as follows from the root of the repository
5962

6063
Here is some quick starter code to accomplish the basic tasks. Find more in the [documentation](https://kcelebi.github.io/flight-analysis/).
6164

62-
# Try to keep the dates in format YYYY-mm-dd
63-
result = Scrape('JFK', 'IST', '2023-07-20', '2023-08-10') # obtain our scrape object
64-
dataframe = result.data # outputs a Pandas DF with flight prices/info
65-
origin = result.origin # 'JFK'
66-
dest = result.dest # 'IST'
67-
date_leave = result.date_leave # '2023-07-20'
68-
date_return = result.date_return # '2023-08-10'
65+
# Keep the dates in format YYYY-mm-dd
66+
result = Scrape('JFK', 'IST', '2023-07-20', '2023-08-20') # obtain our scrape object, represents out query
67+
result.type # This is in a round-trip format
68+
result.origin # ['JFK', 'IST']
69+
result.dest # ['IST', 'JFK']
70+
result.dates # ['2023-07-20', '2023-08-20']
71+
print(result) # get unqueried str representation
6972

70-
You can also scrape for one-way trips now:
73+
A `Scrape` object represents a Google Flights query to be run. It maintains flights as a sequence of one or more one-way flights which have a origin, destination, and flight date. The above object for a round-trip flight from JFK to IST is a sequence of JFK --> IST, then IST --> JFK. We can obtain the data as follows:
74+
75+
ScrapeObjects(result) # runs selenium through ChromeDriver, modifies results in-place
76+
result.data # returns pandas DF
77+
print(result) # get queried representation of result
78+
79+
You can also scrape for one-way trips:
7180

7281
results = Scrape('JFK', 'IST', '2023-08-20')
73-
result.data.head() #see data
82+
ScrapeObjects(result)
83+
result.data #see data
84+
85+
You can also scrape chain-trips, which are defined as a sequence of one-way flights that have no direct relation to each other, other than being in chronological order.
86+
87+
# chain-trip format: origin, dest, date, origin, dest, date, ...
88+
result = Scrape('JFK', 'IST', '2023-08-20', 'RDU', 'LGA', '2023-12-25', 'EWR', 'SFO', '2024-01-20')
89+
result.type # chain-trip
90+
ScrapeObjects(result)
91+
result.data # see data
92+
93+
You can also scrape perfect-chains, which are defined as a sequence of one-way flights such that the destination of the previous flight is the origin of the next and the origin of the chain is the final destination of the chain (a cycle).
94+
95+
# perfect-chain format: origin, date, origin, date, ..., first_origin
96+
result = Scrape("JFK", "2023-09-20", "IST", "2023-09-25", "CDG", "2023-10-10", "LHR", "2023-11-01", "JFK")
97+
result.type # perfect-chain
98+
ScrapeObjects(result)
99+
result.data # see data
100+
101+
You can read more about the different type of trips in the documentation. Scrape objects can be added to one another to create larger queries. This is under the conditions:
74102

103+
1. The objects being added are the same type of trip (one-way, round-trip, etc)
104+
2. The objects being added are either both unqueried or both queried
75105

76106
## Updates & New Features
77107

requirements.txt

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
tqdm
22
numpy
3-
pandas==2.0.1
3+
pandas
44
selenium
5-
pytest==7.2.2
6-
sqlalchemy
5+
pytest
6+
sqlalchemy
7+
chromedriver-autoinstaller

setup.cfg

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ install_requires =
2424
numpy
2525
pandas
2626
selenium
27+
sqlalchemy
28+
chromedriver-autoinstaller
2729

2830
[options.packages.find]
2931
where = src

src/google_flight_analysis/flight.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@
88

99
class Flight:
1010

11-
def __init__(self, dl, *args):
11+
def __init__(self, date, *args):
1212
self._id = 1
1313
self._origin = None
1414
self._dest = None
15-
self._date = dl
16-
self._dow = datetime.strptime(dl, '%Y-%m-%d').isoweekday() # day of week
15+
self._date = date
16+
self._dow = datetime.strptime(date, '%Y-%m-%d').isoweekday() # day of week
1717
self._airline = None
1818
self._flight_time = None
1919
self._num_stops = None
@@ -105,7 +105,7 @@ def time_arrive(self):
105105
return self._time_arrive
106106

107107
def _classify_arg(self, arg):
108-
if ('AM' in arg or 'PM' in arg) and len(self._times) < 2:
108+
if ('AM' in arg or 'PM' in arg) and len(self._times) < 2 and ':' in arg:
109109
# arrival or departure time
110110
delta = timedelta(days = 0)
111111
if arg[-2] == '+':

0 commit comments

Comments
 (0)