URL: https://www.opennet.me/cgi-bin/openforum/vsluhboard.cgi
Форум: vsluhforumID9
Нить номер: 703
[ Назад ]

Исходное сообщение
"ms word (*.doc) format"

Отправлено uin , 18-Апр-02 15:16 
please, tell me how can I read such format under unix. I just wanna to read only the text but no pictures if any (or may be other windows controls).
Namely, how can I read only the text from doc file. As a result I want to get only the list of words from this text (search engine require such a thing). So it doesnt matter the regularity of text.

examples'r prefferred.

thank u.


Содержание

Сообщения в этом обсуждении
"RE: ms word (*.doc) format"
Отправлено Soldier , 19-Апр-02 07:38 
>please, tell me how can I
>read such format under unix.
>I just wanna to read
>only the text but no
>pictures if any (or may
>be other windows controls).
>Namely, how can I read only
>the text from doc file.
>As a result I want
>to get only the list
>of words from this text
>(search engine require such a
>thing). So it doesnt matter
>the regularity of text.
>
>examples'r prefferred.
>
>thank u.

I hope this will help:
http://sourceforge.net/project/showfiles.php?group_id=10501&...


"Hm"
Отправлено uin , 19-Апр-02 13:33 
It looks very strange. Do u understand how does It works?

thank u


"RE: Hm"
Отправлено Soldier , 19-Апр-02 15:17 
>It looks very strange. Do u
>understand how does It works?
>
>
>thank u


For example, if you want to extract a text only from the file file.doc:

wvWare -x /usr/local/share/wv/wvText.xml file.doc > somefile.txt

By default it produces output in html format.

P.S. To my opinion it looks a bit ugly, but I do not know about another software for proccesing and converting MS Word documents. May be somebody else knows...


"clear, but"
Отправлено uin , 19-Апр-02 16:34 
I do not wanna to translate it in suuch way. I wanna to use it's API if any. So could u show me such an example?

tnk u


"RE: clear, but"
Отправлено Soldier , 19-Апр-02 22:21 
>I do not wanna to translate
>it in suuch way. I
>wanna to use it's API
>if any. So could u
>show me such an example?
>
>
>tnk u

Sorry, but I used it only once to extract some info from a numerous of word documents - perl, awk, and C function 'popen' were enough for those purposes.  So use 'popen' for now and try to find another software or examine this source code for the future.

Best.



"RE: ms word (*.doc) format"
Отправлено Арлекин , 22-Апр-02 07:57 
man -s1 strings

This "feature" out txt strings from any file (and binary too). Test it. Source codes for Linux or BSD are no problem.


"RE: ms word (*.doc) format"
Отправлено Soldier , 22-Апр-02 15:35 
>man -s1 strings
>
>This "feature" out txt strings from
>any file (and binary too).
>Test it. Source codes for
>Linux or BSD are no
>problem.

Depends on task. In my case it was unacceptable.:-(