PHP preg_replace – some useful regular expressions
April 22, 2009
There loads of these all over the place, but here’s some useful preg_replace examples for text and html processing that were hard to find or I ended up writing – use/praise/embellish/flame as you see fit.
Remove repeated words (case insensitive)
$text = preg_replace("/\s(\w+\s)\1/i", "$1", $text);
‘Keep your your head’ becomes ‘Keep your head’
Remove repeated punctuation
$text = preg_replace("/\.+/i", ".", $text);
‘Keep your head…’ becomes ‘Keep your head.’ Don’t forget to escape regex characters.
Clean up a sentence end that has no trailing space
$text = preg_replace("/\.(?! )/i", ". ", $text);
‘Keep your head.Don’t fall apart’ becomes ‘Keep your head. Don’t fall apart’ This uses lookahead.
Remove carriage returns, line feeds and tabs
$text = str_replace(array("\r\n", "\r", "\n", "\t"), '', $text);
An oldy but goody.
Get all image urls from an html document
$images = array();
preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $data, $media);
unset($data);
$data=preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
foreach($data as $url)
{
$info = pathinfo($url);
if (isset($info['extension']))
{
if (($info['extension'] == 'jpg') ||
($info['extension'] == 'jpeg') ||
($info['extension'] == 'gif') ||
($info['extension'] == 'png'))
array_push($images, $url);
}
}
Puts all the image URLs in an array
Strip non printable characters
$text = preg_replace("/[^[:print:]]+/", "", $text);
Does what it says on the tin
Remove HTML tags
$text = preg_replace ( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu',
// Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu',),
array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text ) // Remove all remaining tags and comments and return. $text = strip_tags( $text );
Ok, so strip_tags sort of does this, but fails to remove script, style etc etc.
July 10, 2009 at 10:50 am
The preg_replace in “Remove HTML tags” dies with an error PREG_SPLIT_OFFSET_CAPTURE if you try to parse a long text (like a full webpage), which is weird because there is no preg_split; I guess preg_replace is using it internally.
But anyway, I can’t use it with a big web page
July 10, 2009 at 11:49 pm
For prosperity,
My error was due to the text not being UTF8. The preg_replace assumes the text is UTF8.
Also, that code is copied from this source:
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
You should cite your sources!
July 10, 2009 at 11:50 pm
I meant posterity, not prosperity :-/
July 27, 2009 at 4:51 pm
[...] » Source [...]
August 3, 2009 at 11:19 pm
[...] » Source [...]
August 5, 2009 at 2:22 pm
[...] » Source [...]
September 23, 2009 at 12:33 am
[...] » Source [...]
October 8, 2009 at 7:11 pm
Thanks, useful tools! What I’m really trying to do though is learn how to create my own regular expressions in PHP, So what would be really really helpful to me is an explanation of how and why each of these work?