new TextExtractor()
TextExtractor is used to analyze a PDF page and extract words and logical
structure within a given region. The resulting list of lines and words can
be traversed element by element or accessed as a string buffer. The class
also includes utility methods to extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor include:
- Converting PDF pages to text or XML for content repurposing.
- Searching PDF pages for specific words or keywords.
- Indexing large PDF repositories for indexing or content
retrieval purposes (i.e. implementing a PDF search engine).
- Classifying or summarizing PDF documents based on their text content.
- Finding specific words for content editing purposes (such as splitting pages
based on keywords etc).
The main task of TextExtractor is to interpret PDF pages and offer a
simple to use API to:
- Normalize all text content to Unicode.
- Extract inferred logical structure (word by word, line by line,
or paragraph by paragraph).
- Extract positioning information for every line, word, or a glyph.
- Extract style information (such as information about the font, font size,
font styles, etc) for every line, word, or a glyph.
- Control the content analysis process. A number of options (such as
removal of text obscured by images) is available to let the user
direct the flow of content recognition algorithms that will meet their
requirements.
- Offer utility methods to convert PDF page content to text, XML, or HTML.
Note: TextExtractor is analyzing only textual content of the page.
This means that the rasterized (e.g. in scanned pages) or vectorized
text (where glyphs are converted to path outlines) will not be recognized
as text. Please note that it is still possible to extract this content
using pdftron.PDF.ElementReader interface.
In some cases TextExtractor may extract text that does not appear to
be on the visible page (e.g. when text is obscured by an image or a
rectangle). In these situations it is possible to use processing flags
such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove
hidden text.
A sample use case (in C++):
... Initialize PDFNet ... PDFDoc doc(filein); doc.InitSecurityHandler(); Page page = *doc.PageBegin(); TextExtractor txt; txt.Begin(page, 0, TextExtractor::e_remove_hidden_text); UString text; txt.GetAsText(text); // or traverse words one by one... TextExtractor::Line line = txt.GetFirstLine(), lend; TextExtractor::Word word, wend; for (; line!=lend; line=line.GetNextLine()) { for (word=line.GetFirstWord(); word!=wend; word=word.GetNextWord()) { text.Assign(word.GetString(), word.GetStringLen()); cout << text << '\n'; } }A sample use case (in C#):
... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.InitSecurityHandler(); Page page = doc.PageBegin().Current(); TextExtractor txt = new TextExtractor(); txt.Begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text); string text = txt.GetAsText(); // or traverse words one by one... TextExtractor.Word word; for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) { for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) { Console.WriteLine(word.GetString()); } }For full sample code, please take a look at TextExtract sample project.
Extends
Members
-
<static> ProcessingFlags
-
Type:
- number
Properties:
Name Type Description e_no_ligature_exp
number e_no_dup_remove
number e_punct_break
number e_remove_hidden_text
number e_no_invisible_text
number e_no_watermarks
number e_extract_using_zorder
number -
<static> XMLOutputFlags
-
Type:
- number
Properties:
Name Type Description e_words_as_elements
number e_output_bbox
number e_output_style_info
number
Methods
-
<static> create()
-
Constructor and destructor
Returns:
A promise that resolves to an object of type: "PDFNet.TextExtractor"- Type
- Promise.<Core.PDFNet.TextExtractor>
-
begin(page [, clip_ptr] [, flags])
-
Start reading the page.
Parameters:
Name Type Argument Description page
Core.PDFNet.Page Page to read. clip_ptr
Core.PDFNet.Rect <optional>
An optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle. flags
number <optional>
A list of ProcessingFlags used to control text extraction algorithm. Returns:
- Type
- Promise.<void>
-
destroy()
-
Destructor
- Inherited From:
Returns:
- Type
- Promise.<void>
-
getAsText( [dehyphen])
-
get all words in the current selection as a single string.
Parameters:
Name Type Argument Description dehyphen
boolean <optional>
If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files. Returns:
A promise that resolves to an object of type: "string"- Type
- Promise.<string>
-
getAsXML( [xml_output_flags])
-
get text content in a form of an XML string.
Parameters:
Name Type Argument Description xml_output_flags
number <optional>
flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags. XML output will be encoded in UTF-8 and will have the following structure: PDFNet SDK is ...PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit... levels. Using the PDFNet PDF library, ... ...Returns:
A promise that resolves to an object of type: "string"- Type
- Promise.<string>
-
getFirstLine()
-
Returns:
A promise that resolves to the first line of text on the selected page. Note: To traverse the list of all text lines on the page use line.GetNextLine(). Note: To traverse the list of all word on a given line use line.GetFirstWord().- Type
- Promise.<Core.PDFNet.TextExtractorLine>
-
getHighlights(char_ranges)
-
Get a Highlights object based on an array of character ranges.
Parameters:
Name Type Description char_ranges
Array.<object> an array of character ranges to be highlighted, such as [{ "index": 1, "length": 10 }, { "index": 100, "length": 20 }] Returns:
A promise that resolves to an object of type: "Highlights", containing the selected characters.- Type
- Promise.<Core.PDFNet.Highlights>
-
getNumLines()
-
Returns:
A promise that resolves to the number of lines of text on the selected page.- Type
- Promise.<number>
-
getQuads(mtx, quads, quads_size)
-
[CURRENTLY BUGGED]
Parameters:
Name Type Description mtx
Core.PDFNet.Matrix2D The quadrilateral representing a tight bounding box quads
number n quads_size
number n for this word (in unrotated page coordinates). Returns:
- Type
- Promise.<void>
-
getRightToLeftLanguage()
-
Returns:
A promise that resolves to the directionality of text extractor.- Type
- Promise.<boolean>
-
getTextUnderAnnot(annot)
-
Get all the characters that intersect an annotation.
Parameters:
Name Type Description annot
Core.PDFNet.Annot The annotation to intersect with. Returns:
A promise that resolves to an object of type: "string"- Type
- Promise.<string>
-
getWordCount()
-
Returns:
A promise that resolves to the number of words on the page.- Type
- Promise.<number>
-
setOCGContext(ctx)
-
Sets the Optional Content Group (OCG) context that should be used when processing the document. This function can be used to change the current OCG context. Optional content (such as PDF layers) will be selectively processed based on the states of optional content groups in the given context.
Parameters:
Name Type Description ctx
Core.PDFNet.OCGContext Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page. Returns:
- Type
- Promise.<void>
-
setRightToLeftLanguage(rtl)
-
Sets the directionality of text extractor. Must be called before the processing of a page started.
Parameters:
Name Type Description rtl
boolean mode reverses the directionality of TextExtractor algorithm. Returns:
- Type
- Promise.<void>
-
takeOwnership()
-
Take the ownership of this object, so that PDFNet.runWithCleanup won't destroy this object.
- Inherited From:
Returns:
- Type
- void