[SOLVED] How to extract specific colour highlighted text from an MS Word document ?

gsen

Distinguished
Mar 25, 2015
219
0
18,680
so i have a word document , a pretty long one, where i have highlighted important text in yellow and purple colours. Now, to maintain a summary of that doc, i want to extract all the highlighted texts of ONLY PURPLE COLOUR to another document. is there any way i can do this ? tried googling, and the highlight text method selects all highlighted texts and the few VBA codes i found don't work.

can somebody help me with this please ?
 
Solution
Family member was doing far more with HTML etc. than Python and was not dealing with original data in "colors".

I have been able to tinker a bit.

The difficulty (for me, thus far) has been getting the .docx into a form where the color codes for "purple" are readily identifiable as a "string" indicating that the following text is purple and when that "purple ends".

I explored saving the .docx as XML or HTML but even the simplest test become quite lengthy and not being a webmaster person the results were quite cumbersome. .pdf and other formats equally cumbersome.

However the following link was interesting. It searches for an input text and then highlights the text.

Java...
Word: being .doc or .docx format?

I suggest taking a look at Python.

Not sure, without much more information, how complicated the process (code) would be.

You will first need to find the applicable color codes (ANSI) by parsing for purple text and then extract that color coded text via string manipulations.

Google "parse edit .doc .docx text change color python".

Then revise the search words and phrasing to narrow or otherwise focus the resulting links.

Here are three of links that I found:

https://www.datacamp.com/community/tutorials/reading-and-editing-pdfs-and-word-documents-from-python

https://automatetheboringstuff.com/chapter13/

https://www.quora.com/How-can-I-find-and-replace-text-in-a-Word-document-using-Python

Be sure to "practice" on copies of your documents - well away from the originals that are also safely backed up elsewhere.

Start simple, then build.
 
  • Like
Reactions: gsen
FYI:

https://www.datanumen.com/blogs/2-quick-ways-extract-texts-color-word-document/

Appears to be a couple of rather straightforward methods.

This is one of those VBAs that never worked 🙁

also in the first method, it extracts ALL highlighted text irrespective of colour and extracts something like this :

phrase1
phrase2
sentence1
phrase3


and when go to find and replace again to replace "^p^p" with spaces, i still cant remove the line breaks and make them coherent . if you know a way around this, id appreciate it
 
Word: being .doc or .docx format?

I suggest taking a look at Python.

Not sure, without much more information, how complicated the process (code) would be.

You will first need to find the applicable color codes (ANSI) by parsing for purple text and then extract that color coded text via string manipulations.

Google "parse edit .doc .docx text change color python".

Then revise the search words and phrasing to narrow or otherwise focus the resulting links.

Here are three of links that I found:

https://www.datacamp.com/community/tutorials/reading-and-editing-pdfs-and-word-documents-from-python

https://automatetheboringstuff.com/chapter13/

https://www.quora.com/How-can-I-find-and-replace-text-in-a-Word-document-using-Python

Be sure to "practice" on copies of your documents - well away from the originals that are also safely backed up elsewhere.

Start simple, then build.
thanks but i was looking at something simple. this looks way more complicated for just an auxiliary task :/
 
Python may prove simpler.

I have a family member who has recently done a lot of work with Python.

Just emailed him a copy of your OP (Original Post).

In the meantime I have been working on being able to identify the control characters in Word that determine font color (yellow or purple).

For example:

https://wordribbon.tips.net/T000261_Discovering_the_RGB_Value_of_a_Custom_Text_Color.html

Unfortunately, Reveal Formatting (RGB etc.) is not functioning on my system - newly installed with Microsoft365. (Still need to figure out the reconfigurations needed for how I work in Word....)

The idea being that instead of changing the RGB code, the text is instead copied and pasted to the desired file.

Note the macro at the end of the article.

Could be Python just as well. At least I believe so.
 
  • Like
Reactions: gsen
Python may prove simpler.

I have a family member who has recently done a lot of work with Python.

Just emailed him a copy of your OP (Original Post).

In the meantime I have been working on being able to identify the control characters in Word that determine font color (yellow or purple).

For example:

https://wordribbon.tips.net/T000261_Discovering_the_RGB_Value_of_a_Custom_Text_Color.html

Unfortunately, Reveal Formatting (RGB etc.) is not functioning on my system - newly installed with Microsoft365. (Still need to figure out the reconfigurations needed for how I work in Word....)

The idea being that instead of changing the RGB code, the text is instead copied and pasted to the desired file.

Note the macro at the end of the article.

Could be Python just as well. At least I believe so.

you are way too kind. thank you. if you hear back from your family member, please let me know. ill try and see, in the meantime, if i can figure it out.

Really appreciate your help !
 
Family member was doing far more with HTML etc. than Python and was not dealing with original data in "colors".

I have been able to tinker a bit.

The difficulty (for me, thus far) has been getting the .docx into a form where the color codes for "purple" are readily identifiable as a "string" indicating that the following text is purple and when that "purple ends".

I explored saving the .docx as XML or HTML but even the simplest test become quite lengthy and not being a webmaster person the results were quite cumbersome. .pdf and other formats equally cumbersome.

However the following link was interesting. It searches for an input text and then highlights the text.

Java:

https://dev.to/jazzzzz/java-find-and-highlight-text-in-word-document-59eb

It seems to me (but I am not a Java person) that you could use (modfiy) the // Set highlight color code to be the "find" and then, once found, export that text to another file.

And I would not quite give up on Visual Basic:

https://www.extendoffice.com/documents/word/5450-word-find-duplicate-sentences.html

Instead of paragraphs you are dealing with words.

Same process: if a "purple" word (wdPurple) is found that word/string is captured and copied elsewhere.
 
Solution