Custom SoftwaresInternet Applications Multimedia Promotional Materials CD Biz Cards Internet Catalogue Web Design Packages
Home Services Projects Clientele Outsourcing Partners Recruitment Talents Galleria Brochure
Artintel System Lab (P) Ltd.Artintel System Lab (P) Ltd.Artintel System Lab (P) Ltd. Site MapMail Us
    Data Extraction
Click and hold down to scroll
Click and hold down to scroll

We extracted data from Websites (PDF or HTML files) in a pre-specified format, using W4f technology and Java and
produce the output in Xml format. We have been doing this as part of a production process and have completed
20 projects. These projects delivered what we call Shopkeeper Units or SKUs.

Extraction projects can be divided into 3 types: Type A, B, and Hand Built extractors.

  • "Type A"extractors look at the individual data source page, typically HTML, and use client defined rules to extract
    the proper data. These extractors generate smaller volumes of data, handle complex data sources and are
    more sensitive to changes in the data source.
  • "Type B"(Programming logic driven) extractors. They read data from a table that defines all the possible
    permutations of data to be generated. Then the extractors iterate through the permutations, creating an XML
    entry for each SKU. These extractors tend to generate large volumes of data.
  • "Hand Built"extractors are just that. Here the client provides the data source with the data to be extracted
    and a list of rules.
We then set up a process where the source documents are processed by hand and the extracted data entered into
database for delivery to the client.

We also extract the data from Websites for Exactone Inc. by writing configuration files and submitting them to the Engine
hosted at the Client site by using a proprietary language called IQL. The Engine, built on proprietary Java based technology,
aggregates the data and generates an online catalogue on the Net for the use of the ultimate customers. Apart from writing
the Config files, we also maintain them to take care of the broken or missing links etc, and ensure that the Catalogue data
is available 24 hours a day, 365 days a year to the end users.