Tuesday, 3 September 2013

Removing all HTML tags from CLOB content PLSQL

Removing all HTML tags from CLOB content PLSQL

I am currently in the process of creating an interface to a ticketing
system used by our company. The ticketing system stores Emails to and from
customers to track the progress of support issues raised by customers.
My interface uses a web service to pull back these tickets, style them and
display them within our system. My main issue is that because the ticket
content comes from an Email, it contains lots of formatting and styling
which causes a conflict with and often breaks my interface. I've therefore
been looking for a way to strip all html and styling from the Email
content so that I can then work with the plain text and style it how I
want.
So far, I've a combination of replace, and regex_replace for example:
l_content_formatter :=
regexp_replace(l_content_formatter,'\s*(style|dir|height|width|class|align|color|face|size|title|href|src|v)\s*=\s*"[^"]*"\s*','');
To remove all attributes from a tag.
l_html_content := replace(l_html_content,'<strong>','');
To remove a strong tag.
However, using this method will take absolutely ages and is generally a
bit hit and miss with regards to removing all html tags and anything that
appears like a tag:
e.g.
<span_mce_t style="font-size:13px">this is some content.</span_mce_t>
Therefore, my main issue is I need to find a pretty robust method of
removing all html tags but retatining their content within a clob
variable.
My colleague suggested looking for occurance of the start of a tag such as
<p
Then calculating it's position and finding the next ending tag's position
and removing everything in between but I don't think this is a very good
method.
Any ideas?

No comments:

Post a Comment