Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Contribute to GitLab
Sign in / Register
Toggle navigation
E
esi-table-data
Project
Project
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
esi-data-scrapping
esi-table-data
Commits
da12a613
Commit
da12a613
authored
Oct 19, 2017
by
Andrii Marynets
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Get media and common info
parent
e8b3ad9c
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
9 additions
and
5 deletions
+9
-5
cb.py
exa/exa/spiders/cb.py
+9
-5
No files found.
exa/exa/spiders/cb.py
View file @
da12a613
# -*- coding: utf-8 -*-
import
json
from
urllib.request
import
urlparse
import
scrapy
from
scrapy.utils.project
import
get_project_settings
from
scrapy_splash
import
SplashRequest
...
...
@@ -113,13 +114,16 @@ class CbSpider(BaseSpider):
item
=
ExaItem
()
item
[
'date'
]
=
self
.
format_date
(
prop
[
'activity_date'
])
item
[
'title'
]
=
prop
[
'activity_properties'
][
'title'
]
item
[
'url'
]
=
prop
[
'activity_properties'
][
'url'
]
item
[
'url'
]
=
prop
[
'activity_properties'
][
'url'
][
'value'
]
publisher
=
prop
[
'activity_properties'
][
'publisher'
]
item
.
update
(
self
.
get_common_items
(
response
.
meta
[
'company'
]))
item
[
'media_id'
]
=
self
.
_get_media
((
publisher
,
item
[
'url'
]))
print
(
item
)
def
_get_media
(
self
,
elem
):
media_name
=
elem
.
xpath
(
"./td[contains(@class, 'article')]/span/text()"
)
.
extract_first
()
media_url
=
elem
.
xpath
(
"./td/a/@data_publisher"
)
.
extract_first
(
)
def
_get_media
(
self
,
site
):
media_name
,
media_url
=
site
clean
=
lambda
x
:
x
[
4
:]
if
x
.
startswith
(
'www.'
)
else
x
media_url
=
clean
(
urlparse
(
media_url
)
.
netloc
)
query
=
"select * from wp_esi_media where name like '
%
{}
%
' or url like '
%
{}
%
'"
.
format
(
media_name
,
media_url
)
media
=
self
.
pipeline
.
db
.
select
(
query
)
if
len
(
media
)
==
0
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment