- Grimport Language Documentation
- Installation of Grimport & Other Software
- Variables & Syntax
- Control Structures
- Extract data from a page
- Export Filtered Data
- Programming with Grimport Script Editor
- Launch Options
- Mistakes & Errors
- The Cloud
- Third-party APIs
- To Go Further
Extract data from a page
In this part, we will see how to extract data with CSS selectors and regular expressions. To do so, we will explain what a CSS selector and a regular expression are and how they work.
Scraping data with CSS selectors
Each element of a web page is registered by one or more unique CSS selectors. You can use these CSS selectors to extract content from the HTML elements on a web page.
Selectors are part of the CSS rule set and select HTML elements based on their Id, class, type, attribute or pseudo-classes.
Here are some basic CSS selectors:
|#id||#firstname||Selects the element with id="firstname"|
|.class||.intro||Selects all elements with class="intro"|
|element.class||p.intro||Selects all <p> elements with class="intro"|
|*||*||Selects all elements|
|element||p||Selects all <p> elements|
|element, element||div, p||Selects all <div> elements and all <p> elements|
You can go to this page for more CSS selectors.
Now we will learn how to use CSS selectors to extract data from an HTML page.
To find the CSS selectors of a web page, you need to know how to do this:
Open Chrome or Firefox, right click on the targeted element and click on Inspect. Go to the highlighted line and right click, you can go to Copy > CSS Selector and see what the browser has guessed.
You can then try it in the Test tab of Grimport Script Editor to validate it.
For the example, we will use this web page: Test Page
#id css selector
The id selector is used to select a single element or all elements under the id.
To retrieve an element with an id, type a hash character (#) followed by the element id.
The “I am a programming language” is extracted using the #myFunction selector.
.class css selector
The class unlike the id can be used to identify multiple elements.
The class selector extracts all elements with a specific class attribute. To select elements with a class, write a dot character (.) followed by the class name.
Here, the .introduction class is used to extract each element selected in the image above.
Now that you have seen how the previous css selectors work, you can practice with the others.
Css selectors with GRIMPORT
We will now see how we can implement it in a Grimport script.
In order to extract specific information from a CSS selector, we use the select() function:
cleanSelect ( cssSelector , _code , _selection )
Parameters beginning with an underscore (_) are optional.
The first parameter is the CSS selector parameter which must be enclosed in quotes (because it is a string). The other parameters are optional.
The second argument, allows you to indicate the code in which the search for the CSS selector should be done. If this parameter is null, the function will use the whole source code of the page being crawled (if you are in a FORPAGE script).
The third argument indicates what you want to extract from the element. It is not enough to indicate an element of the source code, you must also say what you want to take from this element, do you want the HTML with the tag of the element (outerHTML), or the HTML inside the tag (innerHTML), the text of the element without the tags (text) or an attribute of the element like the src property of an img tag (indicate the name of the attribute, e.g.: "src")
If there are elements corresponding to this CSS selector, select() returns the contents of the first selector found.
<ul id="listofLanguages"> <li>Java </li> <li>Python </li> <li>Groovy </li> <li>C++ </li> </ul>
firstLanguage = select("#listofLanguages li") console(firstLanguage) // -> "Java" allLanguages = select("#listofLanguages", null, "text") console(allLanguages) /* -> "Java Python Groovy C++"*/
There is a very useful function, which allows to clean the data and to extract it at the same time: cleanSelect. The function is the same as select, with a 4th argument which is the data cleaning mode.
All functions select, selectAll, regex, regexAll, have their equivalent combining data cleaning. Each time you just have to add clean in front.
If you want to extract all elements, you can use selectAll(), which will find all elements matching the selector.
Example (with cleanSelectAll !) :
<ul id="listofLanguages"> <li>Java </li> <li>Python </li> <li>Groovy </li> <li>C++ </li> </ul>
language = cleanSelectAll("#listofLanguages") console(language) // -> ["Java", "Python", "Groovy", "C++"]
Scraping data with regular expressions
The use of regular expressions has the advantage of allowing the expression to adapt flexibly to any type of string.
It is a very powerful tool that uses a specific search pattern.
Here is a page that will allow you to know the basics of regular expressions: Regular expressions Basic topics
Before starting this tutorial, I advise you to know the basics of regular expressions.
In this example we will use the following page: Test Page
Some elements of this page will be more difficult than others to extract, so we will use regular expressions.
You can also test your regular expressions on Grimport Script Editor in the Test tab.
Here we want to extract the title of this page: "Welcome to CSS Selector Test Page"
The part of the code that interests us is:
The regex to extract the title that is inside <h1> and </h1> is:
<h1>Welcome to CSS Selector Test Page</h1>
It sounds scary, but it's actually very simple:
- First, the special flag declaration (?si) tells the regex that the . include the caracter newline (s) (by default it is not the case), and we ignore case differences (i).
- Then there is the literal <h1> which means that what we want to extract starts by the string <h1>.
- \s* signals that there can be any amount of white space.
- In the brackets is what we want to extract. This is called an extraction mask.
- [^<>]* means that we want all the characters that exist except for < and >.
- Finally, <\/h1> indicates that we want to stop here. There is a backslash before the /h1 because it is an escape sequence. Indeed, the / character is used to delimit a regular expression, a bit like the " character which delimits a string.
title = regex(/(?si)<h1>\s*([^<>]*)<\/h1>/
) console(title) // -> "Welcome to CSS Selector Test Page
Sometimes it makes sense to be a little more specific than <h1> at the beginning of the regex when there are multiple <h1>.
For example, in the page code, there are multiple <h2>, however we only want to extract one: "Java Programming Language"
It will not be enough to write :
It will then be necessary to write :
Moreover here, several <h2> subtitles are in strong, that is to say that this code will correspond to several subtitles. We will see in the part below how to avoid this problem.
Regular expressions with GRIMPORT
We will now see how this is done with Grimport.
We use here the cleanRegex() function:
cleanRegex ( regex , _code , _postProcessing , _numberOfMask )
cleanRegex() returns the mask of the first regex found:
language = cleanRegex(/(?si)<h2><strong>([^<>]*)<\/strong><\/h2>/) console(language) // -> Java Programming Language
You can then use cleanRegexAll(), which returns the mask of a regex for all matching instances:
language = cleanRegexAll(/(?si)<h2><strong>([^<>]*)<\/strong><\/h2>/) console(language) // ->[Java Programming Language, C++ Programming Language]
You can then use the list extraction functions on what cleanSelectAll returns to take for example the second element of the array with the get() function:
language = get(cleanRegexAll(/(?si)<h2><strong>([^<>]*)<\/strong><\/h2>/),1) console(language) // -> C++ Programming Language
You can also extract all h2 by changing your regular expression:
It is possible to use several masks in a regular expression, if you do so and the mask to extract is not the first one, indicate it in the appropriate regex parameter. To know the number of the mask, count the number of opening brackets "(" from the left, until the mask you want to extract.
h2 = cleanRegexAll(/(?si)<h2><[^<>]*>([^<>]*)<\/h2>/) console(h2) // -> [Java Programming Language, Python Programming Language, Groovy Programming Language, C++ Programming Language]
h2 = regex(/(?si)<((h2)|(h1))><[^<>]*>([^<>]*)<\/((h2)|(h1))>/, null, 4) console(h2) // -> "Welcome to CSS Selector Test Page"
One of the important tasks in any web application is proper sanitization and standardization of data. Any data stored in a database should be in a standardized format, specially data that comes from a external sources.
There are several functions that allow to clean the data in order to standardize it. Here are the main ones:
- standardizeText() allows to correct a lot of problems like encoding errors, it converts HTML codes like é into "é" character or it replaces some apostrophes like ‘ to the original Unicode code into what is standard apostrophe '.
- stripTags() allows to remove all HTML tags from a code.
- number() allows to extract the number from an HTML code.
- htmlToPrice() allows to extract a price from a potentially "dirty" code.
In Grimport, we will rather use functions starting with "clean" like cleanSelect() or cleanRegex() than select() and regex() because they are made of an additional argument _postProcessing which is a cleaning function.
By default, this argument will correspond to standardizeTest + stripTags.
Here is how the _postProcessing argument is constructed:
- null or nothing = stripTags + standardizeText
- "description" or "d" = standardizeText
- "price" or "p" = htmlToPrice
- "number", "decimal", "float" or "n" = number with decimal
- "integer" or "i" = numeric integer
- "none" or "-" or "." or "0" = nothing
For example, if you want to extract a price with the select() function, it is better to write:
myPrice= cleanSelect("#myPrice", null, null , "price")
myPrice = htmlToPrice(select("#rrp-price"))
Now that we've seen how to extract data with CSS selectors and regular expressions and you've seen how to clean up the data, I'd like to show you a short video example that will allow you to extract information from an online product on this page:
Next ❯ ❮ Previous