There loads of these all over the place, but here’s some useful preg_replace examples for text and html processing that were hard to find or I ended up writing  – use/praise/embellish/flame as you see fit.

Remove repeated words (case insensitive)

$text = preg_replace("/\s(\w+\s)\1/i", "$1", $text);

‘Keep your your head’ becomes ‘Keep your head’

 Remove repeated punctuation

$text = preg_replace("/\.+/i", ".", $text); 

 ‘Keep your head…’ becomes ‘Keep your head.’ Don’t forget to escape regex characters.

Clean up a sentence end that has no trailing space

$text = preg_replace("/\.(?! )/i", ". ", $text);

‘Keep your head.Don’t fall apart’ becomes ‘Keep your head. Don’t fall apart’  This uses lookahead.

Remove carriage returns, line feeds and tabs

$text = str_replace(array("\r\n", "\r", "\n", "\t"), '', $text);

An oldy but goody.

Get all image urls from an html document

$images = array();
preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $data, $media);
unset($data);
$data=preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
foreach($data as $url)
{
	$info = pathinfo($url);
	if (isset($info['extension']))
	{
		if (($info['extension'] == 'jpg') ||
		($info['extension'] == 'jpeg') ||
		($info['extension'] == 'gif') ||
		($info['extension'] == 'png'))
		array_push($images, $url);
	}
}
Puts all the image URLs in an array

Strip non printable characters

$text = preg_replace("/[^[:print:]]+/", "", $text);

Does what it says on the tin

Remove HTML tags

$text = preg_replace
	(
	array(
	// Remove invisible content
	'@<head[^>]*?>.*?</head>@siu',
	'@<style[^>]*?>.*?</style>@siu',
	'@<script[^>]*?.*?</script>@siu',
	'@<object[^>]*?.*?</object>@siu',
	'@<embed[^>]*?.*?</embed>@siu',
	'@<applet[^>]*?.*?</applet>@siu',
	'@<noframes[^>]*?.*?</noframes>@siu',
	'@<noscript[^>]*?.*?</noscript>@siu',
	'@<noembed[^>]*?.*?</noembed>@siu',
	// Add line breaks before & after blocks
	'@<((br)|(hr))@iu',
	'@</?((address)|(blockquote)|(center)|(del))@iu',
	'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
	'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
	'@</?((table)|(th)|(td)|(caption))@iu',
	'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
	'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
	'@</?((frameset)|(frame)|(iframe))@iu',),
	array(
	' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
	"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
	"\n\$0", "\n\$0",),$text
	)
// Remove all remaining tags and comments and return.
$text = strip_tags( $text );

Ok, so strip_tags sort of does this, but fails to remove script, style etc etc.

8 Responses to “PHP preg_replace – some useful regular expressions”

  1. qwe Says:

    The preg_replace in “Remove HTML tags” dies with an error PREG_SPLIT_OFFSET_CAPTURE if you try to parse a long text (like a full webpage), which is weird because there is no preg_split; I guess preg_replace is using it internally.

    But anyway, I can’t use it with a big web page :(

  2. qwe Says:

    For prosperity,

    My error was due to the text not being UTF8. The preg_replace assumes the text is UTF8.

    Also, that code is copied from this source:
    http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

    You should cite your sources!

  3. qwe Says:

    I meant posterity, not prosperity :-/

  4. dagfooyo Says:

    Thanks, useful tools! What I’m really trying to do though is learn how to create my own regular expressions in PHP, So what would be really really helpful to me is an explanation of how and why each of these work?


Leave a Reply