You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53356][PYTHON][DOCS] Small improvements to python data source docs
### What changes were proposed in this pull request?
A few small improvements to the Python data source docs:
- Added more type annotations
- Added a little bit of hierarchy to make the relationships between sections more clear: nested several sections underneath a "Comprehensive Example: Data Source with Batch and Streaming Readers and Writers" section.
- A few small wording tweaks.
### Why are the changes needed?
I was reading the Python data source docs to learn how to use Python data sources and got confused by a few small things.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
Closes#52101 from sryza/python-data-source-docs.
Lead-authored-by: Sandy Ryza <sandyryza@gmail.com>
Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com>
Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com>
Copy file name to clipboardExpand all lines: python/docs/source/tutorial/sql/python_data_source.rst
+85-51Lines changed: 85 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,16 +26,18 @@ Overview
26
26
The Python Data Source API is a new feature introduced in Spark 4.0, enabling developers to read from custom data sources and write to custom data sinks in Python.
27
27
This guide provides a comprehensive overview of the API and instructions on how to create, use, and manage Python data sources.
28
28
29
-
Simple Example
30
-
--------------
29
+
Simple Example: Data Source with Batch Reader
30
+
---------------------------------------------
31
31
Here's a simple Python data source that generates exactly two rows of synthetic data.
32
32
This example demonstrates how to set up a custom data source without using external libraries, focusing on the essentials needed to get it up and running quickly.
33
33
34
34
**Step 1: Define the data source**
35
35
36
36
.. code-block:: python
37
37
38
-
from pyspark.sql.datasource import DataSource, DataSourceReader
38
+
from typing import Iterator, Tuple
39
+
40
+
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition
39
41
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
40
42
41
43
classSimpleDataSource(DataSource):
@@ -44,21 +46,21 @@ This example demonstrates how to set up a custom data source without using exter
@@ -86,17 +88,18 @@ This example demonstrates how to set up a custom data source without using exter
86
88
# +-----+---+
87
89
88
90
89
-
Creating a Python Data Source
90
-
-----------------------------
91
-
To create a custom Python data source, you'll need to subclass the :class:`DataSource` base classes and implement the necessary methods for reading and writing data.
91
+
Comprehensive Example: Data Source with Batch and Streaming Readers and Writers
To create a custom Python data source, you'll need to subclass the :class:`DataSource` base class and implement the necessary methods for reading and writing data.
92
94
93
95
This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment.
94
96
95
-
**Define the Data Source**
97
+
Define the Data Source
98
+
~~~~~~~~~~~~~~~~~~~~~~
96
99
97
-
Start by creating a new subclass of :class:`DataSource` with the source name, schema.
100
+
Start by creating a new subclass of :class:`DataSource` with the source name and schema.
98
101
99
-
In order to be used as source or sink in batch or streaming query, corresponding method of DataSource needs to be implemented.
102
+
In order to be used as source or sink in batch or streaming query, corresponding methods of DataSource needs to be implemented.
100
103
101
104
Method that needs to be implemented for a capability:
102
105
@@ -112,7 +115,15 @@ Method that needs to be implemented for a capability:
112
115
113
116
.. code-block:: python
114
117
115
-
from pyspark.sql.datasource import DataSource, DataSourceReader
118
+
from typing import Union
119
+
120
+
from pyspark.sql.datasource import (
121
+
DataSource,
122
+
DataSourceReader,
123
+
DataSourceStreamReader,
124
+
DataSourceStreamWriter,
125
+
DataSourceWriter
126
+
)
116
127
from pyspark.sql.types import StructType
117
128
118
129
classFakeDataSource(DataSource):
@@ -123,39 +134,36 @@ Method that needs to be implemented for a capability:
123
134
"""
124
135
125
136
@classmethod
126
-
defname(cls):
137
+
defname(cls) -> str:
127
138
return"fake"
128
139
129
-
defschema(self):
140
+
defschema(self) -> Union[StructType, str]:
130
141
return"name string, date string, zipcode string, state string"
@@ -171,10 +179,10 @@ Define the reader logic to generate synthetic data. Use the `faker` library to p
171
179
row.append(value)
172
180
yieldtuple(row)
173
181
174
-
**Implement the Writer**
182
+
Implement a Batch Writer
183
+
~~~~~~~~~~~~~~~~~~~~~~~~
175
184
176
-
Create a fake data source writer that processes each partition of data, counts the rows, and either
177
-
prints the total count of rows after a successful write or the number of failed tasks if the writing process fails.
185
+
Create a fake data source writer that processes each partition of data, counts the rows, and either prints the total count of rows after a successful write or the number of failed tasks if the writing process fails.
178
186
179
187
.. code-block:: python
180
188
@@ -208,16 +216,15 @@ prints the total count of rows after a successful write or the number of failed
208
216
print(f"Number of failed tasks: {failed_count}")
209
217
210
218
211
-
Implementing Streaming Reader and Writer for Python Data Source
This is a dummy streaming data reader that generate 2 rows in every microbatch. The streamReader instance has a integer offset that increase by 2 in every microbatch.
216
223
217
224
.. code-block:: python
218
225
219
226
classRangePartition(InputPartition):
220
-
def__init__(self, start, end):
227
+
def__init__(self, start: int, end: int):
221
228
self.start = start
222
229
self.end = end
223
230
@@ -238,14 +245,14 @@ This is a dummy streaming data reader that generate 2 rows in every microbatch.
This is invoked when the query has finished processing data before end offset, this can be used to clean up resource.
251
258
"""
@@ -259,24 +266,44 @@ This is a dummy streaming data reader that generate 2 rows in every microbatch.
259
266
for i inrange(start, end):
260
267
yield (i, str(i))
261
268
262
-
**Implement the Simple Stream Reader**
269
+
Alternative: Implement a Simple Streaming Reader
270
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263
271
264
272
If the data source has low throughput and doesn't require partitioning, you can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
265
273
266
-
One of simpleStreamReader() and streamReader() must be implemented for readable streaming data source. And simpleStreamReader() will only be invoked when streamReader() is not implemented.
274
+
One of simpleStreamReader() and streamReader() must be implemented for a readable streaming data source. And simpleStreamReader() will only be invoked when streamReader() is not implemented.
275
+
276
+
.. code-block:: python
277
+
278
+
from pyspark.sql.datasource import SimpleDataSourceStreamReader
User defined DataSource, DataSourceReader, DataSourceWriter, DataSourceStreamReader and DataSourceStreamWriter and their methods must be able to be serialized by pickle.
350
384
351
-
For library that are used inside a method, it must be imported inside the method. For example, TaskContext must be imported inside the read() method in the code below.
385
+
For libraries that are used inside a method, they must be imported inside the method. For example, TaskContext must be imported inside the read() method in the code below.
352
386
353
387
.. code-block:: python
354
388
@@ -358,7 +392,7 @@ For library that are used inside a method, it must be imported inside the method
358
392
359
393
Using a Python Data Source
360
394
--------------------------
361
-
**Use a Python Data Source in Batch Query**
395
+
**Register a Python Data Source**
362
396
363
397
After defining your data source, it must be registered before usage.
0 commit comments