Skip to content

Commit 11160e9

Browse files
committed
Add section on composing and verbose regular expressions
1 parent 1501833 commit 11160e9

File tree

1 file changed

+214
-16
lines changed

1 file changed

+214
-16
lines changed

source-code/regexes/regexes.ipynb

Lines changed: 214 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Regular expressions in Python"
7+
"Regular expressions are very useful in many situations, and not exclusive to Python. In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks. This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview."
88
]
99
},
1010
{
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"Regular expressions are very useful in many situations, and not exclusive to Python. In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks. This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview."
14+
"# Requirements"
1515
]
1616
},
1717
{
@@ -23,9 +23,9 @@
2323
},
2424
{
2525
"cell_type": "code",
26-
"execution_count": null,
26+
"execution_count": 4,
2727
"metadata": {
28-
"collapsed": true
28+
"tags": []
2929
},
3030
"outputs": [],
3131
"source": [
@@ -36,7 +36,7 @@
3636
"cell_type": "markdown",
3737
"metadata": {},
3838
"source": [
39-
"## Match making"
39+
"# Match making"
4040
]
4141
},
4242
{
@@ -57,7 +57,10 @@
5757
"cell_type": "code",
5858
"execution_count": null,
5959
"metadata": {
60-
"collapsed": false
60+
"collapsed": false,
61+
"jupyter": {
62+
"outputs_hidden": false
63+
}
6164
},
6265
"outputs": [],
6366
"source": [
@@ -78,7 +81,10 @@
7881
"cell_type": "code",
7982
"execution_count": null,
8083
"metadata": {
81-
"collapsed": false
84+
"collapsed": false,
85+
"jupyter": {
86+
"outputs_hidden": false
87+
}
8288
},
8389
"outputs": [],
8490
"source": [
@@ -99,7 +105,10 @@
99105
"cell_type": "code",
100106
"execution_count": null,
101107
"metadata": {
102-
"collapsed": false
108+
"collapsed": false,
109+
"jupyter": {
110+
"outputs_hidden": false
111+
}
103112
},
104113
"outputs": [],
105114
"source": [
@@ -120,7 +129,10 @@
120129
"cell_type": "code",
121130
"execution_count": null,
122131
"metadata": {
123-
"collapsed": false
132+
"collapsed": false,
133+
"jupyter": {
134+
"outputs_hidden": false
135+
}
124136
},
125137
"outputs": [],
126138
"source": [
@@ -134,7 +146,7 @@
134146
"cell_type": "markdown",
135147
"metadata": {},
136148
"source": [
137-
"## Extracting stuff"
149+
"# Extracting stuff"
138150
]
139151
},
140152
{
@@ -155,7 +167,10 @@
155167
"cell_type": "code",
156168
"execution_count": null,
157169
"metadata": {
158-
"collapsed": false
170+
"collapsed": false,
171+
"jupyter": {
172+
"outputs_hidden": false
173+
}
159174
},
160175
"outputs": [],
161176
"source": [
@@ -177,7 +192,7 @@
177192
"cell_type": "markdown",
178193
"metadata": {},
179194
"source": [
180-
"## Substitution"
195+
"# Substitution"
181196
]
182197
},
183198
{
@@ -191,7 +206,10 @@
191206
"cell_type": "code",
192207
"execution_count": null,
193208
"metadata": {
194-
"collapsed": false
209+
"collapsed": false,
210+
"jupyter": {
211+
"outputs_hidden": false
212+
}
195213
},
196214
"outputs": [],
197215
"source": [
@@ -212,7 +230,10 @@
212230
"cell_type": "code",
213231
"execution_count": null,
214232
"metadata": {
215-
"collapsed": false
233+
"collapsed": false,
234+
"jupyter": {
235+
"outputs_hidden": false
236+
}
216237
},
217238
"outputs": [],
218239
"source": [
@@ -221,6 +242,183 @@
221242
" new_file_name = re.sub(r'(\\w+)_(\\d+)\\.', r'\\2_\\1.', file_name)\n",
222243
" print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))"
223244
]
245+
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {},
249+
"source": [
250+
"# Composition"
251+
]
252+
},
253+
{
254+
"cell_type": "markdown",
255+
"metadata": {},
256+
"source": [
257+
"Sophisticated regular expressions tend to be very hard to read. There are a couple of things you can do to mitigate that issue.\n",
258+
"* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n",
259+
"* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression."
260+
]
261+
},
262+
{
263+
"cell_type": "markdown",
264+
"metadata": {},
265+
"source": [
266+
"Consider the following example, a log message. We want to extract the date-time information, the log level, the process number and the exit value."
267+
]
268+
},
269+
{
270+
"cell_type": "code",
271+
"execution_count": 26,
272+
"metadata": {},
273+
"outputs": [],
274+
"source": [
275+
"log_entry = '2021-08-25 17:04:23.439405 [info]: end process 1 exited with 2'"
276+
]
277+
},
278+
{
279+
"cell_type": "markdown",
280+
"metadata": {},
281+
"source": [
282+
"Rather than writing a regular expression that describes the entire log message, we write expressions that match part of it."
283+
]
284+
},
285+
{
286+
"cell_type": "code",
287+
"execution_count": 33,
288+
"metadata": {},
289+
"outputs": [],
290+
"source": [
291+
"date = r'\\d{4}-\\d{2}-\\d{2}'\n",
292+
"time = r'\\d{2}:\\d{2}:\\d{2}\\.\\d+'"
293+
]
294+
},
295+
{
296+
"cell_type": "markdown",
297+
"metadata": {},
298+
"source": [
299+
"Let's check that the time matches."
300+
]
301+
},
302+
{
303+
"cell_type": "code",
304+
"execution_count": 34,
305+
"metadata": {},
306+
"outputs": [
307+
{
308+
"data": {
309+
"text/plain": [
310+
"'17:04:23.439405'"
311+
]
312+
},
313+
"execution_count": 34,
314+
"metadata": {},
315+
"output_type": "execute_result"
316+
}
317+
],
318+
"source": [
319+
"match = re.search(time, log_entry)\n",
320+
"match.group(0)"
321+
]
322+
},
323+
{
324+
"cell_type": "markdown",
325+
"metadata": {},
326+
"source": [
327+
"We cna now use `date` and `time` to match the entire date-time value."
328+
]
329+
},
330+
{
331+
"cell_type": "code",
332+
"execution_count": 45,
333+
"metadata": {},
334+
"outputs": [
335+
{
336+
"data": {
337+
"text/plain": [
338+
"'2021-08-25 17:04:23.439405'"
339+
]
340+
},
341+
"execution_count": 45,
342+
"metadata": {},
343+
"output_type": "execute_result"
344+
}
345+
],
346+
"source": [
347+
"regex = re.compile(r'({date}\\s+{time})'.format(date=date, time=time))\n",
348+
"match = regex.search(log_entry)\n",
349+
"match.group(1)"
350+
]
351+
},
352+
{
353+
"cell_type": "code",
354+
"execution_count": 46,
355+
"metadata": {},
356+
"outputs": [],
357+
"source": [
358+
"level = r'\\[(\\w+)\\]'\n",
359+
"msg = r'end\\s+process\\s+(\\d+)\\s+exited\\s+with\\s+(\\d+)'"
360+
]
361+
},
362+
{
363+
"cell_type": "code",
364+
"execution_count": 47,
365+
"metadata": {},
366+
"outputs": [
367+
{
368+
"name": "stdout",
369+
"output_type": "stream",
370+
"text": [
371+
"datetime = 2021-08-25 17:04:23.439405\n",
372+
"log level: info\n",
373+
"process = 1\n",
374+
"exit status = 2\n"
375+
]
376+
}
377+
],
378+
"source": [
379+
"regex = re.compile(r'({date}\\s+{time})\\s+{level}\\s*:\\s*{msg}'.format(date=date, time=time, level=level, msg=msg))\n",
380+
"match = regex.match(log_entry)\n",
381+
"print(f'datetime = {match.group(1)}')\n",
382+
"print(f'log level: {match.group(2)}')\n",
383+
"print(f'process = {match.group(3)}')\n",
384+
"print(f'exit status = {match.group(4)}')"
385+
]
386+
},
387+
{
388+
"cell_type": "markdown",
389+
"metadata": {},
390+
"source": [
391+
"Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
392+
]
393+
},
394+
{
395+
"cell_type": "code",
396+
"execution_count": 48,
397+
"metadata": {},
398+
"outputs": [
399+
{
400+
"name": "stdout",
401+
"output_type": "stream",
402+
"text": [
403+
"datetime = 2021-08-25 17:04:23.439405\n",
404+
"log level: info\n",
405+
"process = 1\n",
406+
"exit status = 2\n"
407+
]
408+
}
409+
],
410+
"source": [
411+
"regex = re.compile(r'''\n",
412+
" ({date}\\s+{time})\\s+ # date-time, up to microsecond presision\n",
413+
" {level}\\s*:\\s* # log level of the log message\n",
414+
" {msg} # actual log message\n",
415+
" '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n",
416+
"match = regex.match(log_entry)\n",
417+
"print(f'datetime = {match.group(1)}')\n",
418+
"print(f'log level: {match.group(2)}')\n",
419+
"print(f'process = {match.group(3)}')\n",
420+
"print(f'exit status = {match.group(4)}')"
421+
]
224422
}
225423
],
226424
"metadata": {
@@ -239,9 +437,9 @@
239437
"name": "python",
240438
"nbconvert_exporter": "python",
241439
"pygments_lexer": "ipython3",
242-
"version": "3.5.1"
440+
"version": "3.7.7"
243441
}
244442
},
245443
"nbformat": 4,
246-
"nbformat_minor": 0
444+
"nbformat_minor": 4
247445
}

0 commit comments

Comments
 (0)