How to Extract Text from PDF Documents Based on Columns inside .NET Apps

This technical tip explains how to extract text from PDF documents based on columns inside .NET Applications. A PDF file may comprise of Text, Images, Annotations, Attachments, Graphs etc elements and Aspose.Pdf for .NET offers the feature to Add as well as manipulate all of these elements. This API is remarkable when comes to Text addition and extraction from PDF document and we may come across a scenario where a PDF document is comprised of more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. The following code snippet can be used to fulfill this requirement. There is also another approach provided with ScaleFactor. We have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ‘Pure’ mode, you may specify ScaleFactor option and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing.
//your code here...//The following code snippet shows the steps to reduce text size and then try extracting text from PDF document. //[C# Code Sample] string path = "D:\\Temp\\"; InitLicense(); Document pdfDocument = new Document(path + "net_New-age NED's.pdf"); TextFragmentAbsorber tfa = new TextFragmentAbsorber(); pdfDocument.Pages.Accept(tfa); TextFragmentCollection tfc = tfa.TextFragments; foreach (TextFragment tf in tfc) { //need to reduce font size at least for 70% tf.TextState.FontSize = tf.TextState.FontSize * 0.7f; } Stream st = new MemoryStream(); pdfDocument.Save(st); pdfDocument = new Document(st); TextAbsorber textAbsorber = new TextAbsorber(); pdfDocument.Pages.Accept(textAbsorber); String extractedText = textAbsorber.Text; textAbsorber.Visit(pdfDocument); System.IO.File.WriteAllText(path + "Extracted.txt", extractedText); // [VB.NET Code Sample] Dim path As String = "D:\\Temp\\" ' instantiate Document object Dim pdfDocument As Document = New Document(path + "net_New-age NED's.pdf") Dim tfa As Aspose.Pdf.Text.TextFragmentAbsorber = New Aspose.Pdf.Text.TextFragmentAbsorber() pdfDocument.Pages.Accept(tfa) Dim tfc As Aspose.Pdf.Text.TextFragmentCollection = tfa.TextFragments For Each tf As Aspose.Pdf.Text.TextFragment In tfc ' need to reduce font size at least for 70% tf.TextState.FontSize = tf.TextState.FontSize * 0.7F Next ' create temporary stream object Dim st As Stream = New MemoryStream() ' save PDF file with reduced font size pdfDocument.Save(st) ' Instantiate Document object with stream instance pdfDocument = New Document(st) Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber() pdfDocument.Pages.Accept(textAbsorber) Dim extractedText As String = textAbsorber.Text textAbsorber.Visit(pdfDocument) System.IO.File.WriteAllText(path + "Extracted.txt", extractedText) //Second approach - Using ScaleFactor //[C# Code Sample] Document pdfDocument = new Document(inputFile); TextAbsorber textAbsorber = new TextAbsorber(); textAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure); //Setting scale factor to 0.5 is enough to split columns in the majority of documents //Setting of zero allows to algorithm choose scale factor automatically textAbsorber.ExtractionOptions.ScaleFactor = 0.5; /* 0; */ pdfDocument.Pages.Accept(textAbsorber); String extractedText = textAbsorber.Text; System.IO.File.WriteAllText(outFile, extractedText); // [VB.NET Code Sample] Dim pdfDocument As Document = New Document(inputFile) Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber() textAbsorber.ExtractionOptions = New TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure) 'Setting scale factor to 0.5 is enough to split columns in the majority of documents 'Setting of zero allows to algorithm choose scale factor automatically textAbsorber.ExtractionOptions.ScaleFactor = 0.5 ' 0; pdfDocument.Pages.Accept(textAbsorber) Dim extractedText As String = textAbsorber.Text System.IO.File.WriteAllText(outFile, extractedText)

Url: http://www.aspose.com/.net/pdf-component.aspx

Language: C# | User: Sheraz Khan | Created: Jan 27, 2016 | Tags: Extract PDf text based on columns Extract text from PDF manipulate annotations in PDF manipulate Attachments in PDF Text addition to PDF document Text extraction from PDF document .NET PDF component